pre-GWAS

Data format

plink --bfile GSE215221 --allow-extra-chr --make-bed --out GSE215221
Data structure of plink format

Alt Text

mkdir -p materials/1kgp materials/msigdb tools QC PopuStruc ASSOC PRS ANNOVAR MAGMA
bbb tools
bB B  bbb plink1.9.zip
bB B  bbb plink2.zip
bB B  bbb regenie_v4.1.gz_x86_64_Linux.zip
bB B  bbb annovar
bB B  |   bbb humandb
bB B  |   bbb annotate_variation.pl
bB B  |   bbb table_annovar.pl
b   |B B  bbb ...
bB B  bbb magma_v1.10_static.zip
bB B  bbb magma_v1.10_static
bB B      bbb g1000_eas.zip
bB B      bbb g1000_subpop.zip
bB B      bbb NCBI38.zip
b   B  B  bbb NCBI37.zip
bbb materials
bB B  bbb 1kgp
bB B  |   bbb all_hg38.pgen.zst
b   |B B  bbb all_hg38.psam
b   |B B  bbb all_hg38.pvar.zst
bB B  bbb msigdb
bbb GSE215221.bed
bbb GSE215221.bim
bbb GSE215221.fam
bbb GSE215221.phecov
bbb marker_info.txt
bbb QC
bbb PopuStruc
bbb ASSOC
bbb ANNOVAR
bbb MAGMA
bbb ...

Tools

cd tools

Main tools for quality control and association tests

curl -o plink1.9.zip https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20241022.zip
curl -o plink2.zip https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_avx2_20250129.zip
# curl -o plink2.zip https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_x86_64_20250129.zip

unzip plink1.9.zip plink
unzip plink2.zip plink2

cp plink plink2 /usr/local/bin/ 
# export PATH=$PATH:$(pwd) # set the current dir to PATH

REGENIE

A computationally efficient GWAS tool for association testing using mixed-effects models to handle complex traits and large datasets.

curl -L -o regenie4.1.zip https://github.com/rgcgithub/regenie/releases/download/v4.1/regenie_v4.1.gz_x86_64_Linux.zip

unzip regenie_v4.1.zip # regenie_v4.1.gz_x86_64_Linux

cp regenie_v4.1.gz_x86_64_Linux /usr/local/bin/regenie

ANNOVAR

A bioinformatics tool for annotating genetic variants, which supports gene-based, region-based, and filter-based annotations

Please register for download

tar xvfz annovar.latest.tar.gz
cd annovar

# download reference database
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avdblist humandb/ 
# more humandb/hg38_avdblist.txt # to check the downloadable reference database

./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar 1000g2015aug humandb/ # take time
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar exac03 humandb/ 
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp151 humandb/ # take time
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp47c humandb/ # take time
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20240611 humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar icgc28 humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar cosmic70 humandb/

./annotate_variation.pl -buildver hg38 -downdb cytoBand humandb/
./annotate_variation.pl -buildver hg38 -downdb gwasCatalog humandb/
./annotate_variation.pl -buildver hg38 -downdb gtexGeneModelV8 humandb/

cd ..

MAGMA

A bioinformatics tool for annotating genetic variants, which supports gene-based, region-based, and filter-based annotations

# main program
curl -o magma_v1.10_static.zip  https://vu.data.surfsara.nl/index.php/s/lxDgt2dNdNr6DYt/download
unzip magma_v1.10_static.zip -d magma_v1.10_static

cd magma_v1.10_static

# LD information
curl -o g1000_eas.zip https://vu.data.surfsara.nl/index.php/s/dz6PYdKOi3xVqHn/download
curl -o g1000_subpop.zip https://vu.data.surfsara.nl/index.php/s/alUE9KnBwCeml2S/download
unzip g1000_eas.zip -d g1000_eas
unzip g1000_subpop.zip -d g1000_subpop

# gene annotation
curl -o NCBI38.zip https://vu.data.surfsara.nl/index.php/s/yj952iHqy5anYhH/download
curl -o NCBI37.zip https://vu.data.surfsara.nl/index.php/s/Pj2orwuF2JYyKxq/download
unzip NCBI38.zip NCBI38.gene.loc
unzip NCBI37.zip NCBI37.3.gene.loc

# gene-set annotation (refer to Materials > MSigDB)

cd ..

Materials

1000 Genomes Project

Download the 1000 Genome Project PLINK-formatted genotype data for subsequent use in divergent ancestry check during sample QC

cd ../materials/1kgp
# 2022-08-04 Byrska-Bishop et al. (build 38, 3202 samples, contigs unphased), uncheck all
curl -L -o all_hg38.pgen.zst https://www.dropbox.com/s/j72j6uciq5zuzii/all_hg38.pgen.zst?dl=1
curl -L -o all_hg38_noannot.pvar.zst https://www.dropbox.com/s/ngbo2xm5ojw9koy/all_hg38_noannot.pvar.zst?dl=1
curl -L -o hg38_orig.psam https://www.dropbox.com/s/gyobtdi904m9bir/hg38_orig.psam?dl=1

mv all_hg38_noannot.pvar.zst all_hg38.pvar.zst
mv hg38_orig.psam all_hg38.psam

plink2 --zst-decompress all_hg38.pgen.zst > all_hg38.

cd ..
all_hg38.psam
#IID PAT MAT SEX SuperPop Population
HG00096 0 0 1 EUR GBR
HG00097 0 0 2 EUR GBR
HG00099 0 0 2 EUR GBR
...

MSigDB

The Molecular Signatures Database (MSigDB) is a resource of tens of thousands of annotated gene sets for use with GSEA software; here is used for MAGMA to perform gene-set analysis

Please login and go to the download page download the following file Alt Text

unzip -j msigdb_v2024.1.Hs_files_to_download_locally.zip -d msigdb_v2024.1.Hs_GMTs

cd ..

Quality control (QC)

cd ../QC
plink --bfile ../GSE215221 --allow-extra-chr --chr 1-23 --make-bed --out GSE215221_wg
plink --bfile ../GSE215221 --allow-extra-chr --chr 1-22 --make-bed --out GSE215221_auto
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_wg.log.
Options in effect:
  --allow-extra-chr
  --bfile ../GSE215221
  --chr 1-23
  --make-bed
  --out GSE215221_wg

1031497 MB RAM detected; reserving 515748 MB for main workspace. 666891 out of 681232 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 616 founders and 0 nonfounders present. Calculating allele frequencies... done. Warning: 66611 het. haploid genotypes present (see GSE215221_wg.hh ); many commands treat these as missing. Total genotyping rate is 0.998695. 666891 variants and 616 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_wg.bed + GSE215221_wg.bim + GSE215221_wg.fam ... done.

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto.log.
Options in effect:
  --allow-extra-chr
  --bfile ../GSE215221
  --chr 1-22
  --make-bed
  --out GSE215221_auto

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 out of 681232 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 616 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998693. 643809 variants and 616 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto.bed + GSE215221_auto.bim + GSE215221_auto.fam ... done.

plink --bfile GSE215221_wg --check-sex --out GSE215221
grep "PROBLEM" GSE215221.sexcheck > rmInd_sex.txt
# R CMD BATCH --args --process=1 QC.R 

plink2 --bfile GSE215221_auto --remove rmInd_sex.txt --write-samples --out GSE215221_auto_1
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221.log.
Options in effect:
  --bfile GSE215221_wg
  --check-sex
  --out GSE215221

1031497 MB RAM detected; reserving 515748 MB for main workspace. 666891 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 616 founders and 0 nonfounders present. Calculating allele frequencies... done. Warning: 66611 het. haploid genotypes present (see GSE215221.hh ); many commands treat these as missing. Total genotyping rate is 0.998695. 666891 variants and 616 people pass filters and QC. Note: No phenotypes present. --check-sex: 13879 Xchr and 0 Ychr variant(s) scanned, 6 problems detected. Report written to GSE215221.sexcheck .

GSE215221.sexcheck
FID IID PEDSEX SNPSEX STATUS F
C0031950 C0031950 2 2 OK 0.02416
C0040097 C0040097 1 1 OK 0.9088
...
C0051121 C0051121 2 1 PROBLEM 0.8952
...
rmInd_sex.txt
FID IID PEDSEX SNPSEX STATUS F
C0051121 C0051121 2 1 PROBLEM 0.8952
C0200431 C0200431 2 1 PROBLEM 0.91
C0200506 C0200506 2 1 PROBLEM 0.9138
C0200624 C0200624 2 1 PROBLEM 0.9121
C0200711 C0200711 1 2 PROBLEM -0.04655
N1100157 N1100157 1 2 PROBLEM -0.0021
sex_check.pdf

Alt Text

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_1.log.
Options in effect:
  --bfile GSE215221_auto
  --out GSE215221_auto_1
  --remove rmInd_sex.txt
  --write-samples

Start time: Mon Mar 17 10:38:08 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. Note: No phenotype data present. --remove: 610 samples remaining. 610 samples (326 females, 284 males; 610 founders) remaining after main filters. --write-samples: Sample IDs written to GSE215221_auto_1.id . End time: Mon Mar 17 10:38:08 2025

GSE215221_auto_1.id
#FID IID
C0031950 C0031950
C0040097 C0040097
C0040101 C0040101
...
plink --bfile GSE215221_auto --keep GSE215221_auto_1.id --het --missing --out GSE215221_auto_1
R CMD BATCH --args --process=2 --imiss_thre=0.05 --het_thre=1 QC.R 
plink2 --bfile GSE215221_auto --keep GSE215221_auto_1.id --remove rmInd_missing_het.txt --write-samples --out GSE215221_auto_2
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_1.log.
Options in effect:
  --bfile GSE215221_auto
  --het
  --keep GSE215221_auto_1.id
  --missing
  --out GSE215221_auto_1

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. --keep: 610 people remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 610 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.998689. --missing: Sample missing data report written to GSE215221_auto_1.imiss, and variant-based missing data report written to GSE215221_auto_1.lmiss. 643809 variants and 610 people pass filters and QC. Note: No phenotypes present. --het: 516907 variants scanned, report written to GSE215221_auto_1.het .

GSE215221_auto_1.imiss
FID IID MISS_PHENO N_MISS N_GENO F_MISS
C0031950 C0031950 Y 273 643809 0.000424
C0040097 C0040097 Y 264 643809 0.0004101
C0040101 C0040101 Y 221 643809 0.0003433
...
GSE215221_auto_1.het
FID IID O(HOM) E(HOM) N(NM) F
C0031950 C0031950 416363 4.166e+05 516655 -0.00224
C0040097 C0040097 416135 4.166e+05 516658 -0.00459
C0040101 C0040101 415233 4.166e+05 516698 -0.01386
...
cr_hr_check.pdf

Alt Text

rmInd_missing_het.txt
FID IID missing_rate heter._rate
C0200416 C0200416 0.000703625 0.184917343370625
C0200436 C0200436 0.000382101 0.181708740344473
C0200540 C0200540 0.00103757 0.18563512155061
C0200550 C0200550 0.00103913 0.182199031975038
C0200609 C0200609 0.00162626 0.186122600878434
C0200721 C0200721 0.000489276 0.182925035229261
C0200804 C0200804 0.0200681 0.208415008689993
N1000223 N1000223 0.000660134 0.186715765710823
N1100212 N1100212 0.00131871 0.186438707577366
Y2000147 Y2000147 0.000779734 0.182000379534567
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_2.log.
Options in effect:
  --bfile GSE215221_auto
  --keep GSE215221_auto_1.id
  --out GSE215221_auto_2
  --remove rmInd_missing_het.txt
  --write-samples

Start time: Mon Mar 17 15:50:54 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. Note: No phenotype data present. --keep: 610 samples remaining. --remove: 600 samples remaining. 600 samples (321 females, 279 males; 600 founders) remaining after main filters. --write-samples: Sample IDs written to GSE215221_auto_2.id . End time: Mon Mar 17 15:50:54 2025

plink --bfile GSE215221_auto --keep GSE215221_auto_2.id --indep-pairwise 50 5 0.2 --out GSE215221_auto_2
plink2 --bfile GSE215221_auto --keep GSE215221_auto_2.id --extract GSE215221_auto_2.prune.in --make-king-table --king-cutoff 0.0442 --out GSE215221_auto_2
plink2 --bfile GSE215221_auto --keep GSE215221_auto_2.id --remove GSE215221_auto_2.king.cutoff.out.id --write-samples --out GSE215221_auto_3
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_2.log.
Options in effect:
  --bfile GSE215221_auto
  --indep-pairwise 50 5 0.2
  --keep GSE215221_auto_2.id
  --out GSE215221_auto_2

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. --keep: 600 people remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 600 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.998714. 643809 variants and 600 people pass filters and QC. Note: No phenotypes present. Pruned 38300 variants from chromosome 1, leaving 23190. Pruned 28467 variants from chromosome 2, leaving 22028. Pruned 23561 variants from chromosome 3, leaving 18771. Pruned 17831 variants from chromosome 4, leaving 16740. Pruned 18666 variants from chromosome 5, leaving 16384. Pruned 32273 variants from chromosome 6, leaving 17203. Pruned 17673 variants from chromosome 7, leaving 14790. Pruned 15003 variants from chromosome 8, leaving 13973. Pruned 21532 variants from chromosome 9, leaving 13285. Pruned 16440 variants from chromosome 10, leaving 13824. Pruned 18434 variants from chromosome 11, leaving 12993. Pruned 15384 variants from chromosome 12, leaving 13215. Pruned 12007 variants from chromosome 13, leaving 9597. Pruned 10153 variants from chromosome 14, leaving 9063. Pruned 11722 variants from chromosome 15, leaving 9319. Pruned 12847 variants from chromosome 16, leaving 10047. Pruned 16555 variants from chromosome 17, leaving 9512. Pruned 8703 variants from chromosome 18, leaving 8858. Pruned 13131 variants from chromosome 19, leaving 7175. Pruned 7491 variants from chromosome 20, leaving 7667. Pruned 4522 variants from chromosome 21, leaving 4327. Pruned 6381 variants from chromosome 22, leaving 4772. Pruning complete. 367076 of 643809 variants removed. Marker lists written to GSE215221_auto_2.prune.in and GSE215221_auto_2.prune.out .

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_2.log.
Options in effect:
  --bfile GSE215221_auto
  --extract GSE215221_auto_2.prune.in
  --keep GSE215221_auto_2.id
  --king-cutoff 0.0442
  --out GSE215221_auto_2

Start time: Mon Mar 17 16:10:27 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. 643809 variants loaded from GSE215221_auto.bim. Note: No phenotype data present. --extract: 276733 variants remaining. --keep: 600 samples remaining. 600 samples (321 females, 279 males; 600 founders) remaining after main filters. 276733 variants remaining after main filters. --king-cutoff pass 1/1: Scanning for rare variants... done. 61011 variants handled by initial scan (215722 remaining). --king-cutoff pass 1/1: Condensing... done. --king-cutoff: 276733 variants processed. --king-cutoff: Excluded sample IDs written to GSE215221_auto_2.king.cutoff.out.id , and 558 remaining sample IDs written to GSE215221_auto_2.king.cutoff.in.id . End time: Mon Mar 17 16:10:27 2025

GSE215221_auto_2.king.cutoff.out.id
FID IID
C0070249 C0070249
C0072289 C0072289
C0072303 C0072303
...
Kinship Coefficient Range Relationship
> 0.354 Duplicate/MZ Twin
[0.177, 0.354] 1st-Degree
[0.0884, 0.177] 2nd-Degree
[0.0442, 0.0884] 3rd-Degree
cryptic_relatedness_check.pdf

Alt Text

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_3.log.
Options in effect:
  --bfile GSE215221_auto
  --keep GSE215221_auto_2.id
  --out GSE215221_auto_3
  --remove GSE215221_auto_2.king.cutoff.out.id
  --write-samples

Start time: Mon Mar 17 16:14:11 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. Note: No phenotype data present. --keep: 600 samples remaining. --remove: 558 samples remaining. 558 samples (300 females, 258 males; 558 founders) remaining after main filters. --write-samples: Sample IDs written to GSE215221_auto_3.id . End time: Mon Mar 17 16:14:11 2025

extract variants in target data from 1000 genomes project data

awk '{print $1,$4,$4,$2}' GSE215221_auto.bim > lst_var.txt
plink2 --pfile ../materials/1kgp/all_hg38 vzs --allow-extra-chr --extract bed1 lst_var.txt --snps-only --max-alleles 2 --set-all-var-ids @:# --rm-dup exclude-all --make-bed --out all_hg38_extract 
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to all_hg38_extract.log.
Options in effect:
  --allow-extra-chr
  --extract bed1 lst_var.txt
  --make-bed
  --max-alleles 2
  --out all_hg38_extract
  --pfile ../materials/1kgp/all_hg38 vzs
  --rm-dup exclude-all
  --set-all-var-ids @:#
  --snps-only

Start time: Tue Mar 18 14:09:21 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 3202 samples (1604 females, 1598 males; 2594 founders) loaded from ../materials/1kgp/all_hg38.psam. 65275489 out of 75193455 variants loaded from ../materials/1kgp/all_hg38.pvar.zst. 2 categorical phenotypes loaded. --rm-dup: 641751 duplicated IDs, 1291084 variants removed. --extract bed1: 63461330 variants excluded. 523075 variants remaining after main filters. Writing all_hg38_extract.fam ... done. Writing all_hg38_extract.bim ... done. Writing all_hg38_extract.bed ... done. End time: Tue Mar 18 14:09:30 2025

all_hg38_extract.bim
1 1:653347 0 653347 A G
1 1:784860 0 784860 C T
1 1:794577 0 794577 C T
...

change SNP id representation (PLINK merge data by SNP id)

plink2 --bfile GSE215221_auto --keep GSE215221_auto_3.id --set-all-var-ids @:# --rm-dup exclude-all --make-bed --out GSE215221_auto_3_
sed -i "s/\./0/g;" GSE215221_auto_3_.bim
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_3_.log.
Options in effect:
  --bfile GSE215221_auto
  --keep GSE215221_auto_3.id
  --make-bed
  --out GSE215221_auto_3_
  --rm-dup exclude-all
  --exclude-all
  --set-all-var-ids @:#

Start time: Tue Mar 18 14:28:05 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. 643809 variants loaded from GSE215221_auto.bim. Note: No phenotype data present. --rm-dup: 3377 duplicated IDs, 7377 variants removed. --keep: 558 samples remaining. 558 samples (300 females, 258 males; 558 founders) remaining after main filters. Writing GSE215221_auto_3_.fam ... done. Writing GSE215221_auto_3_.bim ... done. Writing GSE215221_auto_3_.bed ... done. End time: Tue Mar 18 14:28:05 2025

GSE215221_auto_3_.bim
1 1:13417 0 13417 0 -
1 1:30794 0 30794 0 G
1 1:46434 0 46434 0 A
...

merge target data and 1000 genomes project data

plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract --make-bed --out tmp || true
if [ -f tmp-merge.missnp ]; then
	plink --bfile all_hg38_extract --flip tmp-merge.missnp --make-bed --out all_hg38_extract_flip
	plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract_flip --make-bed --out GSE215221_auto_3_1 || true

	if [ -f GSE215221_auto_3_1-merge.missnp ]; then
		plink --bfile all_hg38_extract_flip --exclude GSE215221_auto_3_1-merge.missnp --make-bed --out all_hg38_extract_flip_2
		plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract_flip_2 --make-bed --out GSE215221_auto_3_1
	fi
else
  mv tmp.bed GSE215221_auto_3_1.bed
  mv tmp.fam GSE215221_auto_3_1.fam
  mv tmp.bim GSE215221_auto_3_1.bim
#	plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract --make-bed --out GSE215221_auto_3_1
fi

awk '($5=="A" && $6=="T") || ($5=="T" && $6=="A") || ($5=="C" && $6=="G") || ($5=="G" && $6=="C") {print $2,$5,$6}' GSE215221_auto_3_1.bim > lst_AT_CG.txt

n1=$(wc -l < GSE215221_auto_3.id)
n2=$(wc -l < all_hg38_extract.fam)
thres=$(bc -l <<< "$((n1 <= n2 ? n1 : n2))/($n1+$n2)")

plink --bfile GSE215221_auto_3_1 --geno 0.05 --exclude lst_AT_CG.txt --make-bed --out GSE215221_auto_3_2
plink2 --bfile GSE215221_auto_3_2 --geno 0.05 --maf 0.01 --mind 0.05 --pca --out GSE215221_auto_3_2

R CMD BATCH --args --process=4 --ancestry_info=../materials/1kgp/all_hg38.psam --data_name=Demo --confid=0.9999 QC.R
plink --bfile GSE215221_auto --keep GSE215221_auto_3.id --remove rmInd_divAncestry.txt --make-bed --out GSE215221_auto_qcSamp
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to tmp.log.
Options in effect:
  --bfile GSE215221_auto_3_
  --bmerge all_hg38_extract
  --make-bed
  --out tmp

1031497 MB RAM detected; reserving 515748 MB for main workspace. 558 people loaded from GSE215221_auto_3_.fam. 3202 people to be merged from all_hg38_extract.fam. Of these, 3202 are new, while 0 are present in the base dataset. 636432 markers loaded from GSE215221_auto_3_.bim. 523075 markers to be merged from all_hg38_extract.bim. Of these, 290 are new, while 522785 are present in the base dataset. Error: 784 variants with 3+ alleles present. * If you believe this is due to strand inconsistency, try --flip with tmp-merge.missnp. (Warning: if this seems to work, strand errors involving SNPs with A/T or C/G alleles probably remain in your data. If LD between nearby SNPs is high, --flip-scan should detect them.) * If you are dealing with genuine multiallelic variants, we recommend exporting that subset of the data to VCF (via e.g. '--recode vcf'), merging with another tool/script, and then importing the result; PLINK is not yet suited to handling them. See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to all_hg38_extract_flip.log.
Options in effect:
  --bfile all_hg38_extract
  --flip tmp-merge.missnp
  --make-bed
  --out all_hg38_extract_flip

1031497 MB RAM detected; reserving 515748 MB for main workspace. 523075 variants loaded from .bim file. 3202 people (1598 males, 1604 females) loaded from .fam. --flip: 784 SNPs flipped. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 2594 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is in [0.9999995, 1). 523075 variants and 3202 people pass filters and QC. Note: No phenotypes present. --make-bed to all_hg38_extract_flip.bed + all_hg38_extract_flip.bim + all_hg38_extract_flip.fam ... done.

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_3_1.log.
Options in effect:
  --bfile GSE215221_auto_3_
  --bmerge all_hg38_extract_flip
  --make-bed
  --out GSE215221_auto_3_1

1031497 MB RAM detected; reserving 515748 MB for main workspace. 558 people loaded from GSE215221_auto_3_.fam. 3202 people to be merged from all_hg38_extract_flip.fam. Of these, 3202 are new, while 0 are present in the base dataset. 636432 markers loaded from GSE215221_auto_3_.bim. 523075 markers to be merged from all_hg38_extract_flip.bim. Of these, 290 are new, while 522785 are present in the base dataset. Error: 771 variants with 3+ alleles present.

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to all_hg38_extract_flip_2.log.
Options in effect:
  --bfile all_hg38_extract_flip
  --exclude GSE215221_auto_3_1-merge.missnp
  --make-bed
  --out all_hg38_extract_flip_2

1031497 MB RAM detected; reserving 515748 MB for main workspace. 523075 variants loaded from .bim file. 3202 people (1598 males, 1604 females) loaded from .fam. --exclude: 522304 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 2594 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is in [0.9999995, 1). 522304 variants and 3202 people pass filters and QC. Note: No phenotypes present. --make-bed to all_hg38_extract_flip_2.bed + all_hg38_extract_flip_2.bim + all_hg38_extract_flip_2.fam ... done.

PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_3_1.log.
Options in effect:
  --bfile GSE215221_auto_3_
  --bmerge all_hg38_extract_flip_2
  --make-bed
  --out GSE215221_auto_3_1

1031497 MB RAM detected; reserving 515748 MB for main workspace. 558 people loaded from GSE215221_auto_3_.fam. 3202 people to be merged from all_hg38_extract_flip_2.fam. Of these, 3202 are new, while 0 are present in the base dataset. 636432 markers loaded from GSE215221_auto_3_.bim. 522304 markers to be merged from all_hg38_extract_flip_2.bim. Of these, 290 are new, while 522014 are present in the base dataset. Performing single-pass merge (3760 people, 636722 variants). Merged fileset written to GSE215221_auto_3_1-merge.bed + GSE215221_auto_3_1-merge.bim + GSE215221_auto_3_1-merge.fam . 636722 variants loaded from .bim file. 3760 people (1856 males, 1904 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 3152 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.846715. 636722 variants and 3760 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto_3_1.bed + GSE215221_auto_3_1.bim + GSE215221_auto_3_1.fam ... done.

lst_AT_CG.txt
1:819064 G C
1:835427 G C
1:858542 C G
...
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_3_2.log. Options in effect: --bfile GSE215221_auto_3_1 --exclude lst_AT_CG.txt --geno 0.05 --make-bed --out GSE215221_auto_3_2

1031497 MB RAM detected; reserving 515748 MB for main workspace. 636722 variants loaded from .bim file. 3760 people (1856 males, 1904 females) loaded from .fam. --exclude: 600773 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 3152 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.83899. 113658 variants removed due to missing genotype data (--geno). 487115 variants and 3760 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto_3_2.bed + GSE215221_auto_3_2.bim + GSE215221_auto_3_2.fam ... done.

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_3_2.log.
Options in effect:
  --bfile GSE215221_auto_3_2
  --geno 0.05
  --maf 0.01
  --mind 0.05
  --out GSE215221_auto_3_2
  --pca

Start time: Tue Mar 18 15:54:46 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 3760 samples (1904 females, 1856 males; 3152 founders) loaded from GSE215221_auto_3_2.fam. 487115 variants loaded from GSE215221_auto_3_2.bim. Note: No phenotype data present. Calculating sample missingness rates... done. 0 samples removed due to missing genotype data (--mind). 3760 samples (1904 females, 1856 males; 3152 founders) remaining after main filters. Calculating allele frequencies... done. --geno: 0 variants removed due to missing genotype data. 111335 variants removed due to allele frequency threshold(s) (--maf/--max-maf/--mac/--max-mac). 375780 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Eigenvectors written to GSE215221_auto_3_2.eigenvec , and eigenvalues written to GSE215221_auto_3_2.eigenval . End time: Tue Mar 18 15:55:08 2025

GSE215221_auto_3_2.eigenval
275.442
160.936
41.4036
30.9307
6.80762
4.7665
4.38759
4.13442
4.07024
3.82537
GSE215221_auto_3_2.eigenvec
#FID IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
0 HG00096 0.00182827 0.0254425 0.00250875 -0.0234405 -0.000501691 -0.0129464 0.0118278 0.000532008 -0.0223398 -0.0612923
0 HG00097 0.00131385 0.0249048 0.00346359 -0.0209927 -0.000600105 0.0148235 0.0121476 -0.00126195 -0.00914142 0.00890024
0 HG00099 0.00144112 0.0253341 0.0024523 -0.0202484 -0.00104474 -0.000259693 0.0332289 -0.00905011 -0.0387468 -0.0657115
...
C0031950 C0031950 -0.0172785 -0.0162383 0.00185633 -0.00683732 -0.00549263 0.00432971 -0.00167519 -0.0154494 0.00673437 -0.03074
C0040097 C0040097 -0.0169158 -0.0156269 0.000739937 -0.00632163 -0.00335534 0.00345126 -0.00300809 -0.0162726 0.0123476 -0.000880811
C0040101 C0040101 -0.0171392 -0.0157113 0.00160724 -0.00565268 -0.00327237 0.00708489 -0.00134046 -0.020669 0.0113995 -0.00920717
...
divAncestry_check.pdf

Alt Text

Alt Text

rmInd_divAncestry.txt
C0062984 C0062984
C0200505 C0200505
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp.log.
Options in effect:
  --bfile GSE215221_auto
  --keep GSE215221_auto_3.id
  --make-bed
  --out GSE215221_auto_qcSamp
  --remove rmInd_divAncestry.txt

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. --keep: 558 people remaining. --remove: 556 people remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.998744. 643809 variants and 556 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto_qcSamp.bed + GSE215221_auto_qcSamp.bim + GSE215221_auto_qcSamp.fam ... done.

plink --bfile GSE215221_auto_qcSamp --geno 0.05 --pheno ../GSE215221.phecov --pheno-name CKD --1 --test-missing --out GSE215221_auto_qcSamp

n=$(wc -l < GSE215221_auto_qcSamp.missing) # 188976 (-1)
thres=$(bc -l <<< "(0.05/($n-1))")
awk -v thres="$thres" '$5<thres {print $2}' GSE215221_auto_qcSamp.missing > rmSNP_missingTest.txt

plink --bfile GSE215221_auto_qcSamp --geno 0.05 --exclude rmSNP_missingTest.txt --write-snplist --out GSE215221_auto_5
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp.log.
Options in effect:
  --1
  --bfile GSE215221_auto_qcSamp
  --geno 0.05
  --out GSE215221_auto_qcSamp
  --pheno ../GSE215221.phecov
  --pheno-name CKD
  --test-missing

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 364 phenotype values present after --pheno. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998744. 14 variants removed due to missing genotype data (--geno). 643795 variants and 556 people pass filters and QC. Among remaining phenotypes, 66 are cases and 298 are controls. (192 phenotypes are missing.) Writing --test-missing report to GSE215221_auto_qcSamp.missing ... done.

GSE215221_auto_qcSamp.missing
CHR SNP F_MISS_A F_MISS_U P
1 Affx-52345991 0.01515 0 0.1813
1 Affx-116685712 0 0.003356 1
1 Affx-474297229 0 0.003356 1
...
rmSNP_missingTest.txt
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_5.log.
Options in effect:
  --bfile GSE215221_auto_qcSamp
  --exclude rmSNP_missingTest.txt
  --geno 0.05
  --out GSE215221_auto_5
  --write-snplist

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. --exclude: 643809 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998744. 14 variants removed due to missing genotype data (--geno). 643795 variants and 556 people pass filters and QC. Note: No phenotypes present. List of variant IDs written to GSE215221_auto_5.snplist .

GSE215221_auto_5.snplist
Affx-52345991
Affx-474295932
Affx-474295945
...
plink --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_5.snplist --maf 0.01 --write-snplist --out GSE215221_auto_6
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_6.log.
Options in effect:
  --bfile GSE215221_auto_qcSamp
  --extract GSE215221_auto_5.snplist
  --maf 0.01
  --out GSE215221_auto_6
  --write-snplist

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. --extract: 643795 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998745. 167393 variants removed due to minor allele threshold(s) (--maf/--max-maf/--mac/--max-mac). 476402 variants and 556 people pass filters and QC. Note: No phenotypes present. List of variant IDs written to GSE215221_auto_6.snplist .

n=$(wc -l < GSE215221_auto_6.snplist) # 476402
thres=$(bc -l <<< 0.05/$n)
plink --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_6.snplist --pheno ../GSE215221.phecov --pheno-name CKD --1 --hwe $thres --make-bed --out GSE215221_auto_qcSamp_qcVar

# plink2 --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_6.snplist --pheno ../GSE215221.phecov --pheno-name DKD --1 --keep-if DKD == control --hwe $thres --write-snplist --out GSE215221_auto_7
# plink2 --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_7.snplist --make-bed --out GSE215221_auto_qcSamp_qcVar
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.log.
Options in effect:
  --1
  --bfile GSE215221_auto_qcSamp
  --extract GSE215221_auto_6.snplist
  --hwe .00000010495337970873
  --make-bed
  --out GSE215221_auto_qcSamp_qcVar
  --pheno ../GSE215221.phecov
  --pheno-name CKD

1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 364 phenotype values present after --pheno. --extract: 476402 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998504. --hwe: 150 variants removed due to Hardy-Weinberg exact test. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 66 are cases and 298 are controls. (192 phenotypes are missing.) --make-bed to GSE215221_auto_qcSamp_qcVar.bed + GSE215221_auto_qcSamp_qcVar.bim

  

GWAS

Population structure

cd ../PopuStruc

No variant selection

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --out GSE215221_auto_qcSamp_qcVar
  --pca biallelic-var-wts

Start time: Wed Mar 19 14:10:52 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). Calculating allele frequencies... done. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.eigenval . End time: Wed Mar 19 14:10:58 2025

Pruning by counts of SNPs

plink --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000 500 0.2 --out GSE215221_auto_qcSamp_qcVar.cnt
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.cnt.prune.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.pruningCnt
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.cnt.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --indep-pairwise 5000 500 0.2
  --out GSE215221_auto_qcSamp_qcVar.cnt

1031497 MB RAM detected; reserving 515748 MB for main workspace. 476252 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 393 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998506. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 95 are cases and 298 are controls. (163 phenotypes are missing.) Pruned 27917 variants from chromosome 1, leaving 19499. Pruned 18088 variants from chromosome 2, leaving 18821. Pruned 15084 variants from chromosome 3, leaving 16183. Pruned 13425 variants from chromosome 4, leaving 14775. Pruned 12682 variants from chromosome 5, leaving 14284. Pruned 23921 variants from chromosome 6, leaving 13430. Pruned 11436 variants from chromosome 7, leaving 12828. Pruned 10699 variants from chromosome 8, leaving 12176. Pruned 16207 variants from chromosome 9, leaving 11297. Pruned 10987 variants from chromosome 10, leaving 12014. Pruned 10232 variants from chromosome 11, leaving 11061. Pruned 9762 variants from chromosome 12, leaving 11491. Pruned 6483 variants from chromosome 13, leaving 8534. Pruned 6466 variants from chromosome 14, leaving 7921. Pruned 6866 variants from chromosome 15, leaving 8068. Pruned 7243 variants from chromosome 16, leaving 8584. Pruned 6950 variants from chromosome 17, leaving 8093. Pruned 6201 variants from chromosome 18, leaving 7849. Pruned 5760 variants from chromosome 19, leaving 6082. Pruned 5119 variants from chromosome 20, leaving 6790. Pruned 3122 variants from chromosome 21, leaving 3882. Pruned 3821 variants from chromosome 22, leaving 4119. Pruning complete. 238471 of 476252 variants removed. Marker lists written to GSE215221_auto_qcSamp_qcVar.cnt.prune.in and GSE215221_auto_qcSamp_qcVar.cnt.prune.out .

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.pruningCnt.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --extract GSE215221_auto_qcSamp_qcVar.cnt.prune.in
  --out GSE215221_auto_qcSamp_qcVar.pruningCnt
  --pca biallelic-var-wts

Start time: Wed Mar 19 14:39:31 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 237781 variants remaining. Calculating allele frequencies... done. 237781 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.pruningCnt.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.pruningCnt.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.pruningCnt.eigenval . End time: Wed Mar 19 14:39:34 2025

Pruning by distance

plink --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000kb 1 0.2 --out GSE215221_auto_qcSamp_qcVar.kb
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.kb.prune.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.pruningKb
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.kb.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --indep-pairwise 5000kb 1 0.2
  --out GSE215221_auto_qcSamp_qcVar.kb

1031497 MB RAM detected; reserving 515748 MB for main workspace. 476252 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 393 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998506. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 95 are cases and 298 are controls. (163 phenotypes are missing.) Pruned 27611 variants from chromosome 1, leaving 19805. Pruned 17969 variants from chromosome 2, leaving 18940. Pruned 14974 variants from chromosome 3, leaving 16293. Pruned 13324 variants from chromosome 4, leaving 14876. Pruned 12553 variants from chromosome 5, leaving 14413. Pruned 23813 variants from chromosome 6, leaving 13538. Pruned 11353 variants from chromosome 7, leaving 12911. Pruned 10636 variants from chromosome 8, leaving 12239. Pruned 16042 variants from chromosome 9, leaving 11462. Pruned 10912 variants from chromosome 10, leaving 12089. Pruned 10125 variants from chromosome 11, leaving 11168. Pruned 9686 variants from chromosome 12, leaving 11567. Pruned 6440 variants from chromosome 13, leaving 8577. Pruned 6427 variants from chromosome 14, leaving 7960. Pruned 6830 variants from chromosome 15, leaving 8104. Pruned 7181 variants from chromosome 16, leaving 8646. Pruned 6905 variants from chromosome 17, leaving 8138. Pruned 6167 variants from chromosome 18, leaving 7883. Pruned 5731 variants from chromosome 19, leaving 6111. Pruned 5081 variants from chromosome 20, leaving 6828. Pruned 3111 variants from chromosome 21, leaving 3893. Pruned 3794 variants from chromosome 22, leaving 4146. Pruning complete. 236665 of 476252 variants removed. Marker lists written to GSE215221_auto_qcSamp_qcVar.kb.prune.in and GSE215221_auto_qcSamp_qcVar.kb.prune.out .

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.pruningKb.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --extract GSE215221_auto_qcSamp_qcVar.kb.prune.in
  --out GSE215221_auto_qcSamp_qcVar.pruningKb
  --pca biallelic-var-wts

Start time: Wed Mar 19 15:42:36 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 239587 variants remaining. Calculating allele frequencies... done. 239587 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.pruningKb.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.pruningKb.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.pruningKb.eigenval . End time: Wed Mar 19 15:42:39 2025

Clumping according to MAF

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --freq --out GSE215221_auto_qcSamp_qcVar

awk 'NR==1{$(NF+1)="pseudoMAF"} NR>1{$(NF+1)=($5>0.5?$5:(1-$5))}1' GSE215221_auto_qcSamp_qcVar.afreq > GSE215221_auto_qcSamp_qcVar.afreq_

plink --bfile ../QC/GSE215221_auto_qcSamp_qcVar --clump GSE215221_auto_qcSamp_qcVar.afreq_ --clump-snp-field ID --clump-field pseudoMAF --clump-p1 1 --clump-p2 1 --out GSE215221_auto_qcSamp_qcVar
awk '{print $3}' GSE215221_auto_qcSamp_qcVar.clumped > GSE215221_auto_qcSamp_qcVar.clumped.in
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.clumped.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.clumping
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --freq
  --out GSE215221_auto_qcSamp_qcVar

Start time: Wed Mar 19 14:45:23 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). Calculating allele frequencies... done. --freq: Allele frequencies (founders only) written to GSE215221_auto_qcSamp_qcVar.afreq . End time: Wed Mar 19 14:45:24 2025

GSE215221_auto_qcSamp_qcVar.afreq_
#CHROM ID REF ALT ALT_FREQS OBS_CT pseudoMAF
1 Affx-116685712 A G 0.0432432 1110 0.956757
1 Affx-11739511 C T 0.0665468 1112 0.933453
1 Affx-13498574 C T 0.158228 1106 0.841772
PLINK v1.90b6.24 64-bit (6 Jun 2021)           www.cog-genomics.org/plink/1.9/
(C) 2005-2021 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --clump GSE215221_auto_qcSamp_qcVar.afreq_
  --clump-field pseudoMAF
  --clump-p1 1
  --clump-p2 1
  --clump-snp-field ID
  --out GSE215221_auto_qcSamp_qcVar

1031497 MB RAM detected; reserving 515748 MB for main workspace. 476252 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 393 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998506. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 95 are cases and 298 are controls. (163 phenotypes are missing.) --clump: 354931 clumps formed from 476252 top variants. Results written to GSE215221_auto_qcSamp_qcVar.clumped .

GSE215221_auto_qcSamp_qcVar.clumped
CHR F SNP BP P TOTAL NSIG S05 S01 S001 S0001 SP2
1 1 Affx-10839387 4723640 0.5 0 0 0 0 0 0 NONE
1 1 Affx-5149459 14183669 0.5 1 1 0 0 0 0 Affx-5150337(1)
1 1 Affx-5395008 15108230 0.5 0 0 0 0 0 0 NONE
...
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar_clumping.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --extract GSE215221_auto_qcSamp_qcVar.clumped.in
  --out GSE215221_auto_qcSamp_qcVar.clumping
  --pca biallelic-var-wts

Start time: Wed Mar 19 15:23:33 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 354931 variants remaining. Calculating allele frequencies... done. 354931 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.clumping.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.clumping.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.clumping.eigenval . End time: Wed Mar 19 15:23:37 2025

Removing long-range LD (LRLD) region

curl -o LRLD_GRCh38.txt https://raw.githubusercontent.com/HsinChouYang/TPMI_Cohort/refs/heads/main/Data%20Pre-processing/Genetic%20Data/LRLD.txt

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --exclude range LRLD_GRCh38.txt --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.rmLRLD
LRLD_GRCh38.txt | Chromosome | Start Position | End Position | Region Name | | ---------- | -------------- | ------------ | ----------- | | 3 | 46734276 | 51834308 | chr3_1 | | 3 | 82552199 | 89185141 | chr3_2 | | 5 | 43880296 | 52063625 | chr5_1 | | 6 | 23433800 | 38969180 | chr6_1 | | 6 | 55103722 | 65888356 | chr6_2 | | 7 | 56595090 | 68032615 | chr7_1 | | 8 | 6843744 | 12873678 | chr8_1 | | 8 | 42449295 | 48374354 | chr8_2 | | 10 | 36954348 | 42730280 | chr10_1 | | 11 | 44906878 | 60416718 | chr11_1 | | 11 | 88427966 | 89565928 | chr11_2 | | 12 | 33093301 | 40436781 | chr12_1 | | 12 | 108118977 | 114040437 | chr12_2 | | 16 | 15979799 | 19179707 | chr16_1 |
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.rmLRLD.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --exclude range LRLD_GRCh38.txt
  --out GSE215221_auto_qcSamp_qcVar.rmLRLD
  --pca biallelic-var-wts

Start time: Wed Mar 19 15:47:23 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --exclude bed1: 20616 variants excluded. Calculating allele frequencies... done. 455636 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.rmLRLD.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.rmLRLD.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.rmLRLD.eigenval . End time: Wed Mar 19 15:47:28 2025

for k in {2..10}; do
  admixture --cv GSE215221_auto_qcSamp_qcVar.cnt.bed $k | tee log${k}.out
done
loading.png

Alt Text

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --maf 0.1 --indep-pairwise 5000kb 1 0.2 --out GSE215221_auto_qcSamp_qcVar.maf01_pK
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.maf01_pruningKb
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.maf01_pK.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --indep-pairwise 5000kb 1 0.2
  --maf 0.1
  --out GSE215221_auto_qcSamp_qcVar.maf01_pK

Start time: Mon Mar 24 14:48:14 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). Calculating allele frequencies... done. 258045 variants removed due to allele frequency threshold(s) (--maf/--max-maf/--mac/--max-mac). 218207 variants remaining after main filters. --indep-pairwise (12 compute threads): 135759/218207 variants removed. Variant lists written to GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.in and GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.out . End time: Mon Mar 24 14:48:15 2025

PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --extract GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.in
  --out GSE215221_auto_qcSamp_qcVar.maf01_pruningKb
  --pca biallelic-var-wts

Start time: Mon Mar 24 14:48:47 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 82448 variants remaining. Calculating allele frequencies... done. 82448 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenval . End time: Mon Mar 24 14:48:48 2025

loading plot

Alt Text

score plot

Alt Text

Association test

cd ../ASSOC
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --covar ../GSE215221.phecov --write-covar --out GSE215221

paste GSE215221.cov ../PopuStruc/GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenvec > GSE215221_wPC.phecov
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --covar ../GSE215221.phecov
  --out GSE215221
  --write-covar

Start time: Mon Mar 24 15:37:43 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 1 binary phenotype loaded (95 cases, 298 controls). 4 covariates loaded from ../GSE215221.phecov. Covariates written to GSE215221.cov . End time: Mon Mar 24 15:37:43 2025

GSE215221_wPC.phecov
#FID IID sex CKD DKD DM #FID IID PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
C0031950 C0031950 2 0 0 0 C0031950 C0031950 -0.00526697 -0.0467302 0.0304832 0.0234176 0.0203419 0.0892954 -0.00586365 -0.0329808 -0.0694723 0.0102252
C0040097 C0040097 1 0 0 0 C0040097 C0040097 0.0243103 -0.0240787 0.0148346 -0.0177402 -0.00743113 0.0842174 -0.0279906 -0.0309194 -0.0671104 -0.0693914
C0040101 C0040101 1 0 0 0 C0040101 C0040101 -0.0348148 0.0201305 -0.0819504 -0.0114334 -0.071228 0.0479485 -0.00806616 0.00514522 -0.0321499 0.0280298
...

Fixed-effects model

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --pheno GSE215221_wPC.phecov --pheno-name CKD --1 --covar GSE215221_wPC.phecov --covar-name sex,PC1-PC10 --covar-variance-standardize --glm no-x-sex hide-covar cols=chrom,pos,ref,alt,a1freq,a1freqcc,gcountcc,nobs,orbeta,se,tz,p,firth,err --out GSE215221
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to GSE215221.log.
Options in effect:
  --1
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --covar GSE215221_wPC.phecov
  --covar-name sex,PC1-PC10
  --covar-variance-standardize
  --glm no-x-sex hide-covar cols=chrom,pos,ref,alt,a1freq,a1freqcc,gcountcc,nobs,orbeta,se,tz,p,firth,err
  --out GSE215221
  --pheno GSE215221_wPC.phecov
  --pheno-name CKD

Start time: Fri Mar 28 14:19:01 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (66 cases, 298 controls). 11 covariates loaded from GSE215221_wPC.phecov. --covar-variance-standardize: 11 covariates transformed. Calculating allele frequencies... done. Warning: --glm remaining case count is less than 10x predictor count for phenotype 'CKD'. --glm logistic-Firth hybrid regression on phenotype 'CKD': done. Results written to GSE215221.CKD.glm.logistic.hybrid . End time: Fri Mar 28 14:19:02 2025

GSE215221.CKD.glm.logistic.hybrid
#CHROM POS ID REF ALT A1 CASE_NON_A1_CT CASE_HET_A1_CT CASE_HOM_A1_CT CTRL_NON_A1_CT CTRL_HET_A1_CT CTRL_HOM_A1_CT A1_FREQ A1_CASE_FREQ A1_CTRL_FREQ FIRTH? OBS_CT OR LOG(OR)_SE Z_STAT P ERRCODE
1 607880 Affx-116685712 A G G 88 7 0 270 27 0 0.0433673 0.0368421 0.0454545 N 392 0.79787 0.450643 -0.501084 0.616312 .
1 629241 Affx-11739511 C T T 82 13 0 264 34 0 0.0597964 0.0684211 0.057047 N 393 1.27972 0.363137 0.679199 0.497012 .
1 784860 Affx-13498574 C T T 63 27 4 213 82 3 0.156888 0.18617 0.147651 N 392 1.41243 0.239662 1.44083 0.149631 .
...

Mixed-effects model

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --export bgen-1.2 bits=8 --out ../QC/GSE215221_auto_qcSamp_qcVar

sed -i '1s/^#//' GSE215221_wPC.phecov # regenie requires header of phenotype file start with: FID IID
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022)   www.cog-genomics.org/plink/2.0/
(C) 2005-2022 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ../QC/GSE215221_auto_qcSamp_qcVar.log.
Options in effect:
  --bfile ../QC/GSE215221_auto_qcSamp_qcVar
  --export bgen-1.2 bits=8
  --out ../QC/GSE215221_auto_qcSamp_qcVar

Start time: Fri Mar 28 14:15:45 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (66 cases, 298 controls). Writing ../QC/GSE215221_auto_qcSamp_qcVar.bgen ... done. Writing ../QC/GSE215221_auto_qcSamp_qcVar.sample ... done. End time: Fri Mar 28 14:15:45 2025

# step 1
regenie \
--step 1 \
--bed ../QC/GSE215221_auto_qcSamp_qcVar \
--phenoFile GSE215221_wPC.phecov \
--phenoCol CKD \
--covarFile GSE215221_wPC.phecov \
--covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--bsize 100 \
--bt \
--out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1

# step 2
regenie \
--step 2 \
--bgen ../QC/GSE215221_auto_qcSamp_qcVar.bgen \
--phenoFile GSE215221_wPC.phecov \
--phenoCol CKD \
--covarFile GSE215221_wPC.phecov \
--covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--bsize 200 \
--bt \
--firth --approx \
--pThresh 0.01 \
--pred GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list \
--out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2
step 1
Start time: Fri Mar 28 14:35:31 2025
          |===========================|
          |      REGENIE v4.1.gz      |
          |===========================|

Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Compiled with Boost Iostream library.

Log of output saved in file : GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1.log

Options in effect: --step 1 \ --bed ../QC/GSE215221_auto_qcSamp_qcVar \ --phenoFile GSE215221_wPC.phecov \ --phenoCol CKD \ --covarFile GSE215221_wPC.phecov \ --covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \ --bsize 100 \ --bt \ --out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1

Fitting null model * bim : [../QC/GSE215221_auto_qcSamp_qcVar.bim] n_snps = 476252 * fam : [../QC/GSE215221_auto_qcSamp_qcVar.fam] n_samples = 556 * bed : [../QC/GSE215221_auto_qcSamp_qcVar.bed] * phenotypes : [GSE215221_wPC.phecov] n_pheno = 1 -dropping observations with missing values at any of the phenotypes -number of phenotyped individuals with no missing data = 364 * covariates : [GSE215221_wPC.phecov] n_cov = 11 -number of individuals with covariate data = 556 * number of individuals used in analysis = 364 * case-control counts for each trait: - 'CKD': 66 cases and 298 controls -fitting null logistic regression on binary phenotypes...done (0ms) -residualizing and scaling phenotypes...done (0ms) -WARNING: Sample size is less than 5,000 so using LOOCV instead of 5-fold CV. * # threads : [127] * block size : [100] * # blocks : [4775] for 476252 variants * # CV folds : [364] * ridge data_l0 : [ 5 : 0.01 0.25 0.5 0.75 0.99 ] * ridge data_l1 : [ 5 : 0.01 0.25 0.5 0.75 0.99 ] * approximate memory usage : 2GB * setting memory...done

Chromosome 1 block [1] : 100 snps (7ms) -residualizing and scaling genotypes...done (3ms) -calc working matrices...done (1ms) -calc level 0 ridge...done (3ms) block [2] : 100 snps (0ms) -residualizing and scaling genotypes...done (2ms) -calc working matrices...done (1ms) -calc level 0 ridge...done (3ms) ... block [4775] : 40 snps (8ms) -residualizing and scaling genotypes...done (3ms) -calc working matrices...done (0ms) -calc level 0 ridge...done (0ms)

Level 1 ridge with logistic regression... -on phenotype 1 (CKD)... done (8753050ms) B Output ------ phenotype 1 (CKD) : 0.01 : Rsq = 0.0709333, MSE = 0.137931, -logLik/N = 0.437073 0.25 : Rsq = 0.480708, MSE = 0.0994974, -logLik/N = 0.331347 0.5 : Rsq = 0.756199, MSE = 0.0673048, -logLik/N = 0.25897 0.75 : Rsq = 0.889101, MSE = 0.0427481, -logLik/N = 0.213876<- min value 0.99 : Rsq = 0.124539, MSE = 0.553264, -logLik/N = 1.91917 * making predictions...writing LOCO predictions...done (2258965ms)

List of blup files written to: [GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list]

Elapsed time : 11189.1s End time: Fri Mar 28 17:42:00 2025

step 2
Start time: Mon Mar 31 12:59:33 2025
          |===========================|
          |      REGENIE v4.1.gz      |
          |===========================|

Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Compiled with Boost Iostream library.

Log of output saved in file : GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2.log

Options in effect: --step 2 \ --bgen ../QC/GSE215221_auto_qcSamp_qcVar.bgen \ --phenoFile GSE215221_wPC.phecov \ --phenoCol CKD \ --covarFile GSE215221_wPC.phecov \ --covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \ --bsize 200 \ --bt \ --firth \ --approx \ --pThresh 0.01 \ --pred GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list \ --out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2

Association testing mode with fast multithreading using OpenMP * bgen : [../QC/GSE215221_auto_qcSamp_qcVar.bgen] -summary : bgen file (v1.2 layout, zlib compressed) with 556 named samples and 476252 variants with 8-bit encoding. * phenotypes : [GSE215221_wPC.phecov] n_pheno = 1 -dropping observations with missing values at any of the phenotypes -number of phenotyped individuals with no missing data = 364 * covariates : [GSE215221_wPC.phecov] n_cov = 11 -number of individuals with covariate data = 556 * number of individuals used in analysis = 364 * case-control counts for each trait: - 'CKD': 66 cases and 298 controls * LOCO predictions : [GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list] -file [/data/home/jiawei/Projects/demo/ASSOC/GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_1.loco] for phenotype 'CKD' * # threads : [127] * block size : [200] * # blocks : [2394] * approximate memory usage : 113MB * using minimum MAC of 5 (variants with lower MAC are ignored) * using fast Firth correction for logistic/cox regression p-values less than 0.01

Chromosome 1 [238 blocks in total] -reading loco predictions for the chromosome...done (0ms) -fitting null logistic regression on binary phenotypes...done (0ms) -fitting null Firth logistic regression on binary phenotypes...done (0ms) block [1/2394] : done (6ms) block [2/2394] : done (0ms) block [3/2394] : done (0ms) ... block [2392/2394] : done (0ms) block [2393/2394] : done (0ms) block [2394/2394] : done (0ms)

Association results stored separately for each trait in files : * [GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2_CKD.regenie]

Number of tests with Firth correction : 55 Number of failed tests : (0/55) Number of ignored tests due to low MAC : 282

Elapsed time : 1.83822s End time: Mon Mar 31 12:59:35 2025

GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2_CKD.regenie
CHROM GENPOS ID ALLELE0 ALLELE1 A1FREQ INFO N TEST BETA SE CHISQ LOG10P EXTRA
1 607880 Affx-116685712 A G 0.0482094 1 363 ADD 0.371931 0.512395 0.526884 0.329829 NA
1 629241 Affx-11739511 C T 0.0631868 1 364 ADD 0.41202 0.446452 0.8517 0.448462 NA
1 784860 Affx-13498574 C T 0.146978 1 364 ADD 0.0470915 0.326839 0.0207595 0.0528428 NA
...
Miami plot

Alt Text

post-GWAS

Functional mapping (ANNOVAR)

cd ../ANNOVAR

prepare input of ANNOVAR

awk '$21<1e-5 {print $1,$2,$2,$4,$5,$3}' ../ASSOC/GSE215221.CKD.glm.logistic.hybrid > GSE215221.CKD.avinput
# in GSE215221.CKD.glm.logistic.hybrid, the p-value is in the 21th column
GSE215221.CKD.avinput
1 11112428 11112428 A G Affx-4191030
2 233772770 233772770 C T Affx-19507335
2 233772898 233772898 C G Affx-19507337
2 233772999 233772999 C G Affx-19507343
6 77637820 77637820 A G Affx-75366404
13 54109045 54109045 G A Affx-9497191
../tools/annovar/table_annovar.pl GSE215221.CKD.avinput ../tools/annovar/humandb/ -buildver hg38 -out GSE215221.CKD -remove \
-protocol refGene,cytoBand,EAS.sites.2015_08,exac03,avsnp151,dbnsfp47c,gwasCatalog,clinvar_20240611,icgc28,cosmic70 -operation gx,r,f,f,f,f,r,f,f,f \
-nastring . -csvout -polish -xref ../tools/annovar/example/gene_fullxref.txt

Gene-based analysis (MAGMA)

cd ../MAGMA

select CHS (Han Chinese South) for reference data

plink --bfile ../tools/magma_v1.10_static/g1000_eas/g1000_eas --keep ../tools/magma_v1.10_static/g1000_subpop/eas_chs.keep --maf 1e-5 --make-bed --out g1000_eas_chs

create annotation file for the data

plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --update-name ../marker_info.txt 2 1 --make-just-bim --out GSE215221_auto_qcSamp_qcVar

grep -v NA GSE215221_auto_qcSamp_qcVar.bim > tmp && mv tmp GSE215221_auto_qcSamp_qcVar.bim

../tools/magma_v1.10_static/magma --annotate \
--snp-loc GSE215221_auto_qcSamp_qcVar.bim \
--gene-loc ../tools/magma_v1.10_static/NCBI38.gene.loc \
--out GSE215221_auto_qcSamp_qcVar
GSE215221_auto_qcSamp_qcVar.genes.annot
148398 1:925741:944581 rs139858754 rs6680268 rs145857933 rs2340588 rs2340587 rs2880024 rs76964081 rs113171913 rs2340591 rs765383278
26155 1:944203:959299 rs2272757 rs10465242 rs117770195 rs13303010
339451 1:959952:965720 rs6696971 rs3935066
...

prepare input of MAGMA

awk '{print $3,$21,$17}' ../ASSOC/GSE215221.CKD.glm.logistic.hybrid > GSE215221.CKD.txt

# MAGMA needs only 'SNP' and 'P' columns of rsnumber and p-value (can also provide 'OBS_CT' of sample size), in this case, we should convert affy id to rsnumber (Can use R/Python to do it)
awk 'NR==FNR {a[$1]=$0; next} $1 in a {print a[$1], $0}' GSE215221.CKD.txt ../marker_info.txt | awk 'BEGIN {print "ID P OBS_CT affy_snp_id SNP Ref_Allele Alt_Allele"} {print}' > GSE215221.CKD.minput
GSE215221.CKD.minput
ID P OBS_CT affy_snp_id SNP Ref_Allele Alt_Allele
Affx-2643493 0.307699 364 Affx-2643493 rs10466213 C G
Affx-7455925 0.60497 361 Affx-7455925 rs10770943 C G
Affx-3102334 0.748745 362 Affx-3102334 rs10827221 G C
...
../tools/magma_v1.10_static/magma --bfile g1000_eas_chs \
--gene-annot GSE215221_auto_qcSamp_qcVar.genes.annot \
--pval GSE215221.CKD.minput ncol=OBS_CT \
--out GSE215221.CKD
Welcome to MAGMA v1.10 (linux/s)
Using flags:
        --bfile g1000_eas_chs
        --gene-annot GSE215221_auto_qcSamp_qcVar.genes.annot
        --pval GSE215221.CKD.minput ncol=OBS_CT
        --out GSE215221.CKD

Start time is 15:58:52, Wednesday 02 Apr 2025

Loading PLINK-format data... Reading file g1000_eas_chs.fam... 105 individuals read Reading file g1000_eas_chs.bim... 11707649 SNPs read Preparing file g1000_eas_chs.bed...

Reading SNP p-values from file GSE215221.CKD.minput... detected 7 variables in file using variable: SNP (SNP id) using variable: P (p-value) using variable: OBS_CT (sample size; discarding SNPs with N < 50) read 476253 lines from file, containing valid SNP p-values for 424499 SNPs in data (89.13% of lines, 3.626% of SNPs in data) Loading gene annotation from file GSE215221_auto_qcSamp_qcVar.genes.annot... 16137 gene definitions read from file found 15956 genes containing valid SNPs in genotype data

Starting gene analysis... using model: SNPwise-mean writing gene analysis results to file GSE215221.CKD.genes.out writing intermediate output to file GSE215221.CKD.genes.raw

End time is 15:59:08, Wednesday 02 Apr 2025 (elapsed: 00:00:16)

GSE215221.CKD.genes.out
GENE CHR START STOP NSNPS NPARAM N ZSTAT P
148398 1 925741 944581 8 4 364 -0.67723 0.75087
26155 1 944203 959299 4 2 363 1.3106 0.094991
339451 1 959952 965720 2 1 364 1.085 0.13895
...