GSE215221.ped
C0031950 | C0031950 | 0 | 0 | 2 | -9 | - |
- |
G | G | A |
A |
... |
C0040097 | C0040097 | 0 | 0 | 1 | -9 | - |
- |
G | G | A |
A |
... |
C0040101 | C0040101 | 0 | 0 | 1 | -9 | - |
- |
G | G | A |
A |
... |
... |
GSE215221.map
1 | Affx-52345991 |
0 | 13417 |
1 | Affx-474295932 | 0 | 30794 |
1 | Affx-474295945 |
0 | 46434 |
... |
plink --bfile GSE215221 --allow-extra-chr --make-bed --out GSE215221
GSE215221.bed
binary file
GSE215221.fam
C0031950 | C0031950 | 0 | 0 | 2 | -9 |
C0040097 | C0040097 | 0 | 0 | 1 | -9 |
C0040101 | C0040101 | 0 | 0 | 1 | -9 |
... |
GSE215221.bim
1 | Affx-52345991 | 0 | 13417 | 0 | - |
1 | Affx-474295932 | 0 | 30794 | 0 | G |
1 | Affx-474295945 | 0 | 46434 | 0 | A |
... |
GSE215221.phecov
FID | IID | sex | CKD | DKD | DM |
---|---|---|---|---|---|
C0031950 | C0031950 | 2 | 0 | 0 | 0 |
C0040097 | C0040097 | 1 | 0 | 0 | 0 |
... | |||||
C0052956 | C0052956 | 2 | NA | NA | 1 |
C0072227 | C0072227 | 1 | NA | NA | 1 |
... | |||||
C0053169 | C0053169 | 1 | 1 | NA | NA |
C0070273 | C0070273 | 2 | 1 | NA | NA |
... | |||||
C0050500 | C0050500 | 1 | NA | 1 | NA |
C0070273 | C0070273 | 2 | NA | 1 | NA |
... |
marker_info.txt (a reference table for affy id and rsnumber)
affy_snp_id | rs_id | Ref_Allele | Alt_Allele |
---|---|---|---|
Affx-2643493 | rs10466213 | C | G |
Affx-7455925 | rs10770943 | C | G |
Affx-3102334 | rs10827221 | G | C |
... |
mkdir -p materials/1kgp materials/msigdb tools QC PopuStruc ASSOC PRS ANNOVAR MAGMA
bb b tools
bB B bb b plink1.9.zip
bB B bb b plink2.zip
bB B bb b regenie_v4.1.gz_x86_64_Linux.zip
bB B bb b annovar
bB B | bb b humandb
bB B | bb b annotate_variation.pl
bB B | bb b table_annovar.pl
b |B B bb b ...
bB B bb b magma_v1.10_static.zip
bB B bb b magma_v1.10_static
bB B bb b g1000_eas.zip
bB B bb b g1000_subpop.zip
bB B bb b NCBI38.zip
b B B bb b NCBI37.zip
bb b materials
bB B bb b 1kgp
bB B | bb b all_hg38.pgen.zst
b |B B bb b all_hg38.psam
b |B B bb b all_hg38.pvar.zst
bB B bb b msigdb
bb b GSE215221.bed
bb b GSE215221.bim
bb b GSE215221.fam
bb b GSE215221.phecov
bb b marker_info.txt
bb b QC
bb b PopuStruc
bb b ASSOC
bb b ANNOVAR
bb b MAGMA
bb b ...
cd tools
Main tools for quality control and association tests
curl -o plink1.9.zip https://s3.amazonaws.com/plink1-assets/plink_linux_x86_64_20241022.zip
curl -o plink2.zip https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_avx2_20250129.zip
# curl -o plink2.zip https://s3.amazonaws.com/plink2-assets/alpha6/plink2_linux_x86_64_20250129.zip
unzip plink1.9.zip plink
unzip plink2.zip plink2
cp plink plink2 /usr/local/bin/
# export PATH=$PATH:$(pwd) # set the current dir to PATH
A computationally efficient GWAS tool for association testing using mixed-effects models to handle complex traits and large datasets.
curl -L -o regenie4.1.zip https://github.com/rgcgithub/regenie/releases/download/v4.1/regenie_v4.1.gz_x86_64_Linux.zip
unzip regenie_v4.1.zip # regenie_v4.1.gz_x86_64_Linux
cp regenie_v4.1.gz_x86_64_Linux /usr/local/bin/regenie
A bioinformatics tool for annotating genetic variants, which supports gene-based, region-based, and filter-based annotations
Please register for download
tar xvfz annovar.latest.tar.gz
cd annovar
# download reference database
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avdblist humandb/
# more humandb/hg38_avdblist.txt # to check the downloadable reference database
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar refGene humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar 1000g2015aug humandb/ # take time
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar exac03 humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar avsnp151 humandb/ # take time
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar dbnsfp47c humandb/ # take time
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar clinvar_20240611 humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar icgc28 humandb/
./annotate_variation.pl -buildver hg38 -downdb -webfrom annovar cosmic70 humandb/
./annotate_variation.pl -buildver hg38 -downdb cytoBand humandb/
./annotate_variation.pl -buildver hg38 -downdb gwasCatalog humandb/
./annotate_variation.pl -buildver hg38 -downdb gtexGeneModelV8 humandb/
cd ..
A bioinformatics tool for annotating genetic variants, which supports gene-based, region-based, and filter-based annotations
# main program
curl -o magma_v1.10_static.zip https://vu.data.surfsara.nl/index.php/s/lxDgt2dNdNr6DYt/download
unzip magma_v1.10_static.zip -d magma_v1.10_static
cd magma_v1.10_static
# LD information
curl -o g1000_eas.zip https://vu.data.surfsara.nl/index.php/s/dz6PYdKOi3xVqHn/download
curl -o g1000_subpop.zip https://vu.data.surfsara.nl/index.php/s/alUE9KnBwCeml2S/download
unzip g1000_eas.zip -d g1000_eas
unzip g1000_subpop.zip -d g1000_subpop
# gene annotation
curl -o NCBI38.zip https://vu.data.surfsara.nl/index.php/s/yj952iHqy5anYhH/download
curl -o NCBI37.zip https://vu.data.surfsara.nl/index.php/s/Pj2orwuF2JYyKxq/download
unzip NCBI38.zip NCBI38.gene.loc
unzip NCBI37.zip NCBI37.3.gene.loc
# gene-set annotation (refer to Materials > MSigDB)
cd ..
Download the 1000 Genome Project PLINK-formatted genotype data for subsequent use in divergent ancestry check during sample QC
cd ../materials/1kgp
# 2022-08-04 Byrska-Bishop et al. (build 38, 3202 samples, contigs unphased), uncheck all
curl -L -o all_hg38.pgen.zst https://www.dropbox.com/s/j72j6uciq5zuzii/all_hg38.pgen.zst?dl=1
curl -L -o all_hg38_noannot.pvar.zst https://www.dropbox.com/s/ngbo2xm5ojw9koy/all_hg38_noannot.pvar.zst?dl=1
curl -L -o hg38_orig.psam https://www.dropbox.com/s/gyobtdi904m9bir/hg38_orig.psam?dl=1
mv all_hg38_noannot.pvar.zst all_hg38.pvar.zst
mv hg38_orig.psam all_hg38.psam
plink2 --zst-decompress all_hg38.pgen.zst > all_hg38.
cd ..
#IID | PAT | MAT | SEX | SuperPop | Population |
---|---|---|---|---|---|
HG00096 | 0 | 0 | 1 | EUR | GBR |
HG00097 | 0 | 0 | 2 | EUR | GBR |
HG00099 | 0 | 0 | 2 | EUR | GBR |
... |
The Molecular Signatures Database (MSigDB) is a resource of tens of thousands of annotated gene sets for use with GSEA software; here is used for MAGMA to perform gene-set analysis
Please login and go to the download page download the following file
unzip -j msigdb_v2024.1.Hs_files_to_download_locally.zip -d msigdb_v2024.1.Hs_GMTs
cd ..
cd ../QC
plink --bfile ../GSE215221 --allow-extra-chr --chr 1-23 --make-bed --out GSE215221_wg
plink --bfile ../GSE215221 --allow-extra-chr --chr 1-22 --make-bed --out GSE215221_auto
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_wg.log. Options in effect: --allow-extra-chr --bfile ../GSE215221 --chr 1-23 --make-bed --out GSE215221_wg1031497 MB RAM detected; reserving 515748 MB for main workspace. 666891 out of 681232 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 616 founders and 0 nonfounders present. Calculating allele frequencies... done. Warning: 66611 het. haploid genotypes present (see GSE215221_wg.hh ); many commands treat these as missing. Total genotyping rate is 0.998695. 666891 variants and 616 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_wg.bed + GSE215221_wg.bim + GSE215221_wg.fam ... done.
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto.log. Options in effect: --allow-extra-chr --bfile ../GSE215221 --chr 1-22 --make-bed --out GSE215221_auto1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 out of 681232 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 616 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998693. 643809 variants and 616 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto.bed + GSE215221_auto.bim + GSE215221_auto.fam ... done.
plink --bfile GSE215221_wg --check-sex --out GSE215221
grep "PROBLEM" GSE215221.sexcheck > rmInd_sex.txt
# R CMD BATCH --args --process=1 QC.R
plink2 --bfile GSE215221_auto --remove rmInd_sex.txt --write-samples --out GSE215221_auto_1
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221.log. Options in effect: --bfile GSE215221_wg --check-sex --out GSE2152211031497 MB RAM detected; reserving 515748 MB for main workspace. 666891 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 616 founders and 0 nonfounders present. Calculating allele frequencies... done. Warning: 66611 het. haploid genotypes present (see GSE215221.hh ); many commands treat these as missing. Total genotyping rate is 0.998695. 666891 variants and 616 people pass filters and QC. Note: No phenotypes present. --check-sex: 13879 Xchr and 0 Ychr variant(s) scanned, 6 problems detected. Report written to GSE215221.sexcheck .
FID | IID | PEDSEX | SNPSEX | STATUS | F |
---|---|---|---|---|---|
C0031950 | C0031950 | 2 | 2 | OK | 0.02416 |
C0040097 | C0040097 | 1 | 1 | OK | 0.9088 |
... | |||||
C0051121 | C0051121 | 2 | 1 | PROBLEM | 0.8952 |
... |
FID | IID | PEDSEX | SNPSEX | STATUS | F |
---|---|---|---|---|---|
C0051121 |
C0051121 |
2 | 1 | PROBLEM | 0.8952 |
C0200431 |
C0200431 |
2 | 1 | PROBLEM | 0.91 |
C0200506 |
C0200506 |
2 | 1 | PROBLEM | 0.9138 |
C0200624 |
C0200624 |
2 | 1 | PROBLEM | 0.9121 |
C0200711 |
C0200711 |
1 | 2 | PROBLEM | -0.04655 |
N1100157 |
N1100157 |
1 | 2 | PROBLEM | -0.0021 |
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_1.log. Options in effect: --bfile GSE215221_auto --out GSE215221_auto_1 --remove rmInd_sex.txt --write-samplesStart time: Mon Mar 17 10:38:08 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. Note: No phenotype data present. --remove: 610 samples remaining.
610
samples (326 females, 284 males; 610 founders) remaining after main filters. --write-samples: Sample IDs written to GSE215221_auto_1.id . End time: Mon Mar 17 10:38:08 2025
#FID | IID |
---|---|
C0031950 | C0031950 |
C0040097 | C0040097 |
C0040101 | C0040101 |
... |
plink --bfile GSE215221_auto --keep GSE215221_auto_1.id --het --missing --out GSE215221_auto_1
R CMD BATCH --args --process=2 --imiss_thre=0.05 --het_thre=1 QC.R
plink2 --bfile GSE215221_auto --keep GSE215221_auto_1.id --remove rmInd_missing_het.txt --write-samples --out GSE215221_auto_2
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_1.log. Options in effect: --bfile GSE215221_auto --het --keep GSE215221_auto_1.id --missing --out GSE215221_auto_11031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. --keep: 610 people remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 610 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.998689. --missing: Sample missing data report written to GSE215221_auto_1.imiss, and variant-based missing data report written to GSE215221_auto_1.lmiss. 643809 variants and 610 people pass filters and QC. Note: No phenotypes present. --het: 516907 variants scanned, report written to GSE215221_auto_1.het .
FID | IID | MISS_PHENO | N_MISS | N_GENO | F_MISS |
---|---|---|---|---|---|
C0031950 | C0031950 | Y | 273 | 643809 | 0.000424 |
C0040097 | C0040097 | Y | 264 | 643809 | 0.0004101 |
C0040101 | C0040101 | Y | 221 | 643809 | 0.0003433 |
... |
FID | IID | O(HOM) | E(HOM) | N(NM) | F |
---|---|---|---|---|---|
C0031950 | C0031950 | 416363 | 4.166e+05 | 516655 | -0.00224 |
C0040097 | C0040097 | 416135 | 4.166e+05 | 516658 | -0.00459 |
C0040101 | C0040101 | 415233 | 4.166e+05 | 516698 | -0.01386 |
... |
FID | IID | missing_rate | heter._rate |
---|---|---|---|
C0200416 |
C0200416 |
0.000703625 | 0.184917343370625 |
C0200436 |
C0200436 |
0.000382101 | 0.181708740344473 |
C0200540 |
C0200540 |
0.00103757 | 0.18563512155061 |
C0200550 |
C0200550 |
0.00103913 | 0.182199031975038 |
C0200609 |
C0200609 |
0.00162626 | 0.186122600878434 |
C0200721 |
C0200721 |
0.000489276 | 0.182925035229261 |
C0200804 |
C0200804 |
0.0200681 | 0.208415008689993 |
N1000223 |
N1000223 |
0.000660134 | 0.186715765710823 |
N1100212 |
N1100212 |
0.00131871 | 0.186438707577366 |
Y2000147 |
Y2000147 |
0.000779734 | 0.182000379534567 |
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_2.log. Options in effect: --bfile GSE215221_auto --keep GSE215221_auto_1.id --out GSE215221_auto_2 --remove rmInd_missing_het.txt --write-samplesStart time: Mon Mar 17 15:50:54 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. Note: No phenotype data present. --keep: 610 samples remaining. --remove: 600 samples remaining.
600
samples (321 females, 279 males; 600 founders) remaining after main filters. --write-samples: Sample IDs written to GSE215221_auto_2.id . End time: Mon Mar 17 15:50:54 2025
plink --bfile GSE215221_auto --keep GSE215221_auto_2.id --indep-pairwise 50 5 0.2 --out GSE215221_auto_2
plink2 --bfile GSE215221_auto --keep GSE215221_auto_2.id --extract GSE215221_auto_2.prune.in --make-king-table --king-cutoff 0.0442 --out GSE215221_auto_2
plink2 --bfile GSE215221_auto --keep GSE215221_auto_2.id --remove GSE215221_auto_2.king.cutoff.out.id --write-samples --out GSE215221_auto_3
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_2.log. Options in effect: --bfile GSE215221_auto --indep-pairwise 50 5 0.2 --keep GSE215221_auto_2.id --out GSE215221_auto_21031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. --keep: 600 people remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 600 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.998714. 643809 variants and 600 people pass filters and QC. Note: No phenotypes present. Pruned 38300 variants from chromosome 1, leaving 23190. Pruned 28467 variants from chromosome 2, leaving 22028. Pruned 23561 variants from chromosome 3, leaving 18771. Pruned 17831 variants from chromosome 4, leaving 16740. Pruned 18666 variants from chromosome 5, leaving 16384. Pruned 32273 variants from chromosome 6, leaving 17203. Pruned 17673 variants from chromosome 7, leaving 14790. Pruned 15003 variants from chromosome 8, leaving 13973. Pruned 21532 variants from chromosome 9, leaving 13285. Pruned 16440 variants from chromosome 10, leaving 13824. Pruned 18434 variants from chromosome 11, leaving 12993. Pruned 15384 variants from chromosome 12, leaving 13215. Pruned 12007 variants from chromosome 13, leaving 9597. Pruned 10153 variants from chromosome 14, leaving 9063. Pruned 11722 variants from chromosome 15, leaving 9319. Pruned 12847 variants from chromosome 16, leaving 10047. Pruned 16555 variants from chromosome 17, leaving 9512. Pruned 8703 variants from chromosome 18, leaving 8858. Pruned 13131 variants from chromosome 19, leaving 7175. Pruned 7491 variants from chromosome 20, leaving 7667. Pruned 4522 variants from chromosome 21, leaving 4327. Pruned 6381 variants from chromosome 22, leaving 4772. Pruning complete. 367076 of 643809 variants removed. Marker lists written to GSE215221_auto_2.prune.in and GSE215221_auto_2.prune.out .
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_2.log. Options in effect: --bfile GSE215221_auto --extract GSE215221_auto_2.prune.in --keep GSE215221_auto_2.id --king-cutoff 0.0442 --out GSE215221_auto_2Start time: Mon Mar 17 16:10:27 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. 643809 variants loaded from GSE215221_auto.bim. Note: No phenotype data present. --extract: 276733 variants remaining. --keep: 600 samples remaining. 600 samples (321 females, 279 males; 600 founders) remaining after main filters. 276733 variants remaining after main filters. --king-cutoff pass 1/1: Scanning for rare variants... done. 61011 variants handled by initial scan (215722 remaining). --king-cutoff pass 1/1: Condensing... done. --king-cutoff: 276733 variants processed. --king-cutoff: Excluded sample IDs written to GSE215221_auto_2.king.cutoff.out.id , and 558 remaining sample IDs written to GSE215221_auto_2.king.cutoff.in.id . End time: Mon Mar 17 16:10:27 2025
FID | IID |
---|---|
C0070249 | C0070249 |
C0072289 | C0072289 |
C0072303 | C0072303 |
... |
Kinship Coefficient Range | Relationship |
---|---|
> 0.354 | Duplicate/MZ Twin |
[0.177, 0.354] | 1st-Degree |
[0.0884, 0.177] | 2nd-Degree |
[0.0442, 0.0884] | 3rd-Degree |
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_3.log. Options in effect: --bfile GSE215221_auto --keep GSE215221_auto_2.id --out GSE215221_auto_3 --remove GSE215221_auto_2.king.cutoff.out.id --write-samplesStart time: Mon Mar 17 16:14:11 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. Note: No phenotype data present. --keep: 600 samples remaining. --remove: 558 samples remaining.
558
samples (300 females, 258 males; 558 founders) remaining after main filters. --write-samples: Sample IDs written to GSE215221_auto_3.id . End time: Mon Mar 17 16:14:11 2025
extract variants in target data from 1000 genomes project data
awk '{print $1,$4,$4,$2}' GSE215221_auto.bim > lst_var.txt
plink2 --pfile ../materials/1kgp/all_hg38 vzs --allow-extra-chr --extract bed1 lst_var.txt --snps-only --max-alleles 2 --set-all-var-ids @:# --rm-dup exclude-all --make-bed --out all_hg38_extract
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to all_hg38_extract.log. Options in effect: --allow-extra-chr --extract bed1 lst_var.txt --make-bed --max-alleles 2 --out all_hg38_extract --pfile ../materials/1kgp/all_hg38 vzs --rm-dup exclude-all --set-all-var-ids @:# --snps-onlyStart time: Tue Mar 18 14:09:21 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 3202 samples (1604 females, 1598 males; 2594 founders) loaded from ../materials/1kgp/all_hg38.psam. 65275489 out of 75193455 variants loaded from ../materials/1kgp/all_hg38.pvar.zst. 2 categorical phenotypes loaded. --rm-dup: 641751 duplicated IDs, 1291084 variants removed. --extract bed1: 63461330 variants excluded. 523075 variants remaining after main filters. Writing all_hg38_extract.fam ... done. Writing all_hg38_extract.bim ... done. Writing all_hg38_extract.bed ... done. End time: Tue Mar 18 14:09:30 2025
1 | 1:653347 |
0 | 653347 | A | G |
1 | 1:784860 |
0 | 784860 | C | T |
1 | 1:794577 |
0 | 794577 | C | T |
... |
change SNP id representation (PLINK merge data by SNP id)
plink2 --bfile GSE215221_auto --keep GSE215221_auto_3.id --set-all-var-ids @:# --rm-dup exclude-all --make-bed --out GSE215221_auto_3_
sed -i "s/\./0/g;" GSE215221_auto_3_.bim
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_3_.log. Options in effect: --bfile GSE215221_auto --keep GSE215221_auto_3.id --make-bed --out GSE215221_auto_3_ --rm-dup exclude-all --exclude-all --set-all-var-ids @:#Start time: Tue Mar 18 14:28:05 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 616 samples (330 females, 286 males; 616 founders) loaded from GSE215221_auto.fam. 643809 variants loaded from GSE215221_auto.bim. Note: No phenotype data present. --rm-dup: 3377 duplicated IDs, 7377 variants removed. --keep: 558 samples remaining. 558 samples (300 females, 258 males; 558 founders) remaining after main filters. Writing GSE215221_auto_3_.fam ... done. Writing GSE215221_auto_3_.bim ... done. Writing GSE215221_auto_3_.bed ... done. End time: Tue Mar 18 14:28:05 2025
1 | 1:13417 |
0 | 13417 | 0 | - |
1 | 1:30794 |
0 | 30794 | 0 | G |
1 | 1:46434 |
0 | 46434 | 0 | A |
... |
merge target data and 1000 genomes project data
plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract --make-bed --out tmp || true
if [ -f tmp-merge.missnp ]; then
plink --bfile all_hg38_extract --flip tmp-merge.missnp --make-bed --out all_hg38_extract_flip
plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract_flip --make-bed --out GSE215221_auto_3_1 || true
if [ -f GSE215221_auto_3_1-merge.missnp ]; then
plink --bfile all_hg38_extract_flip --exclude GSE215221_auto_3_1-merge.missnp --make-bed --out all_hg38_extract_flip_2
plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract_flip_2 --make-bed --out GSE215221_auto_3_1
fi
else
mv tmp.bed GSE215221_auto_3_1.bed
mv tmp.fam GSE215221_auto_3_1.fam
mv tmp.bim GSE215221_auto_3_1.bim
# plink --bfile GSE215221_auto_3_ --bmerge all_hg38_extract --make-bed --out GSE215221_auto_3_1
fi
awk '($5=="A" && $6=="T") || ($5=="T" && $6=="A") || ($5=="C" && $6=="G") || ($5=="G" && $6=="C") {print $2,$5,$6}' GSE215221_auto_3_1.bim > lst_AT_CG.txt
n1=$(wc -l < GSE215221_auto_3.id)
n2=$(wc -l < all_hg38_extract.fam)
thres=$(bc -l <<< "$((n1 <= n2 ? n1 : n2))/($n1+$n2)")
plink --bfile GSE215221_auto_3_1 --geno 0.05 --exclude lst_AT_CG.txt --make-bed --out GSE215221_auto_3_2
plink2 --bfile GSE215221_auto_3_2 --geno 0.05 --maf 0.01 --mind 0.05 --pca --out GSE215221_auto_3_2
R CMD BATCH --args --process=4 --ancestry_info=../materials/1kgp/all_hg38.psam --data_name=Demo --confid=0.9999 QC.R
plink --bfile GSE215221_auto --keep GSE215221_auto_3.id --remove rmInd_divAncestry.txt --make-bed --out GSE215221_auto_qcSamp
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to tmp.log. Options in effect: --bfile GSE215221_auto_3_ --bmerge all_hg38_extract --make-bed --out tmp1031497 MB RAM detected; reserving 515748 MB for main workspace. 558 people loaded from GSE215221_auto_3_.fam. 3202 people to be merged from all_hg38_extract.fam. Of these, 3202 are new, while 0 are present in the base dataset. 636432 markers loaded from GSE215221_auto_3_.bim. 523075 markers to be merged from all_hg38_extract.bim. Of these, 290 are new, while 522785 are present in the base dataset. Error: 784 variants with 3+ alleles present. * If you believe this is due to strand inconsistency, try --flip with tmp-merge.missnp. (Warning: if this seems to work, strand errors involving SNPs with A/T or C/G alleles probably remain in your data. If LD between nearby SNPs is high, --flip-scan should detect them.) * If you are dealing with genuine multiallelic variants, we recommend exporting that subset of the data to VCF (via e.g. '--recode vcf'), merging with another tool/script, and then importing the result; PLINK is not yet suited to handling them. See https://www.cog-genomics.org/plink/1.9/data#merge3 for more discussion.
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to all_hg38_extract_flip.log. Options in effect: --bfile all_hg38_extract --flip tmp-merge.missnp --make-bed --out all_hg38_extract_flip1031497 MB RAM detected; reserving 515748 MB for main workspace. 523075 variants loaded from .bim file. 3202 people (1598 males, 1604 females) loaded from .fam. --flip: 784 SNPs flipped. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 2594 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is in [0.9999995, 1). 523075 variants and 3202 people pass filters and QC. Note: No phenotypes present. --make-bed to all_hg38_extract_flip.bed + all_hg38_extract_flip.bim + all_hg38_extract_flip.fam ... done.
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_3_1.log. Options in effect: --bfile GSE215221_auto_3_ --bmerge all_hg38_extract_flip --make-bed --out GSE215221_auto_3_11031497 MB RAM detected; reserving 515748 MB for main workspace. 558 people loaded from GSE215221_auto_3_.fam. 3202 people to be merged from all_hg38_extract_flip.fam. Of these, 3202 are new, while 0 are present in the base dataset. 636432 markers loaded from GSE215221_auto_3_.bim. 523075 markers to be merged from all_hg38_extract_flip.bim. Of these, 290 are new, while 522785 are present in the base dataset. Error: 771 variants with 3+ alleles present.
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to all_hg38_extract_flip_2.log. Options in effect: --bfile all_hg38_extract_flip --exclude GSE215221_auto_3_1-merge.missnp --make-bed --out all_hg38_extract_flip_21031497 MB RAM detected; reserving 515748 MB for main workspace. 523075 variants loaded from .bim file. 3202 people (1598 males, 1604 females) loaded from .fam. --exclude: 522304 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 2594 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is in [0.9999995, 1). 522304 variants and 3202 people pass filters and QC. Note: No phenotypes present. --make-bed to all_hg38_extract_flip_2.bed + all_hg38_extract_flip_2.bim + all_hg38_extract_flip_2.fam ... done.
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_3_1.log. Options in effect: --bfile GSE215221_auto_3_ --bmerge all_hg38_extract_flip_2 --make-bed --out GSE215221_auto_3_11031497 MB RAM detected; reserving 515748 MB for main workspace. 558 people loaded from GSE215221_auto_3_.fam. 3202 people to be merged from all_hg38_extract_flip_2.fam. Of these, 3202 are new, while 0 are present in the base dataset. 636432 markers loaded from GSE215221_auto_3_.bim. 522304 markers to be merged from all_hg38_extract_flip_2.bim. Of these, 290 are new, while 522014 are present in the base dataset. Performing single-pass merge (3760 people, 636722 variants). Merged fileset written to GSE215221_auto_3_1-merge.bed + GSE215221_auto_3_1-merge.bim + GSE215221_auto_3_1-merge.fam . 636722 variants loaded from .bim file. 3760 people (1856 males, 1904 females) loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 3152 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.846715. 636722 variants and 3760 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto_3_1.bed + GSE215221_auto_3_1.bim + GSE215221_auto_3_1.fam ... done.
1:819064 | G | C |
1:835427 | G | C |
1:858542 | C | G |
... |
1031497 MB RAM detected; reserving 515748 MB for main workspace. 636722 variants loaded from .bim file. 3760 people (1856 males, 1904 females) loaded from .fam. --exclude: 600773 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 3152 founders and 608 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.83899. 113658 variants removed due to missing genotype data (--geno). 487115 variants and 3760 people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto_3_2.bed + GSE215221_auto_3_2.bim + GSE215221_auto_3_2.fam ... done.
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_3_2.log. Options in effect: --bfile GSE215221_auto_3_2 --geno 0.05 --maf 0.01 --mind 0.05 --out GSE215221_auto_3_2 --pcaStart time: Tue Mar 18 15:54:46 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 3760 samples (1904 females, 1856 males; 3152 founders) loaded from GSE215221_auto_3_2.fam. 487115 variants loaded from GSE215221_auto_3_2.bim. Note: No phenotype data present. Calculating sample missingness rates... done. 0 samples removed due to missing genotype data (--mind). 3760 samples (1904 females, 1856 males; 3152 founders) remaining after main filters. Calculating allele frequencies... done. --geno: 0 variants removed due to missing genotype data. 111335 variants removed due to allele frequency threshold(s) (--maf/--max-maf/--mac/--max-mac). 375780 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Eigenvectors written to GSE215221_auto_3_2.eigenvec , and eigenvalues written to GSE215221_auto_3_2.eigenval . End time: Tue Mar 18 15:55:08 2025
275.442 |
160.936 |
41.4036 |
30.9307 |
6.80762 |
4.7665 |
4.38759 |
4.13442 |
4.07024 |
3.82537 |
#FID | IID | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | HG00096 | 0.00182827 | 0.0254425 | 0.00250875 | -0.0234405 | -0.000501691 | -0.0129464 | 0.0118278 | 0.000532008 | -0.0223398 | -0.0612923 |
0 | HG00097 | 0.00131385 | 0.0249048 | 0.00346359 | -0.0209927 | -0.000600105 | 0.0148235 | 0.0121476 | -0.00126195 | -0.00914142 | 0.00890024 |
0 | HG00099 | 0.00144112 | 0.0253341 | 0.0024523 | -0.0202484 | -0.00104474 | -0.000259693 | 0.0332289 | -0.00905011 | -0.0387468 | -0.0657115 |
... | |||||||||||
C0031950 | C0031950 | -0.0172785 | -0.0162383 | 0.00185633 | -0.00683732 | -0.00549263 | 0.00432971 | -0.00167519 | -0.0154494 | 0.00673437 | -0.03074 |
C0040097 | C0040097 | -0.0169158 | -0.0156269 | 0.000739937 | -0.00632163 | -0.00335534 | 0.00345126 | -0.00300809 | -0.0162726 | 0.0123476 | -0.000880811 |
C0040101 | C0040101 | -0.0171392 | -0.0157113 | 0.00160724 | -0.00565268 | -0.00327237 | 0.00708489 | -0.00134046 | -0.020669 | 0.0113995 | -0.00920717 |
... |
C0062984 | C0062984 |
C0200505 | C0200505 |
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp.log. Options in effect: --bfile GSE215221_auto --keep GSE215221_auto_3.id --make-bed --out GSE215221_auto_qcSamp --remove rmInd_divAncestry.txt1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 616 people (286 males, 330 females) loaded from .fam. --keep: 558 people remaining. --remove: 556 people remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate in remaining samples is 0.998744. 643809 variants and
556
people pass filters and QC. Note: No phenotypes present. --make-bed to GSE215221_auto_qcSamp.bed + GSE215221_auto_qcSamp.bim + GSE215221_auto_qcSamp.fam ... done.
plink --bfile GSE215221_auto_qcSamp --geno 0.05 --pheno ../GSE215221.phecov --pheno-name CKD --1 --test-missing --out GSE215221_auto_qcSamp
n=$(wc -l < GSE215221_auto_qcSamp.missing) # 188976 (-1)
thres=$(bc -l <<< "(0.05/($n-1))")
awk -v thres="$thres" '$5<thres {print $2}' GSE215221_auto_qcSamp.missing > rmSNP_missingTest.txt
plink --bfile GSE215221_auto_qcSamp --geno 0.05 --exclude rmSNP_missingTest.txt --write-snplist --out GSE215221_auto_5
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp.log. Options in effect: --1 --bfile GSE215221_auto_qcSamp --geno 0.05 --out GSE215221_auto_qcSamp --pheno ../GSE215221.phecov --pheno-name CKD --test-missing1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 364 phenotype values present after --pheno. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998744.
14
variants removed due to missing genotype data (--geno). 643795 variants and 556 people pass filters and QC. Among remaining phenotypes, 66 are cases and 298 are controls. (192 phenotypes are missing.) Writing --test-missing report to GSE215221_auto_qcSamp.missing ... done.
CHR | SNP | F_MISS_A | F_MISS_U | P |
---|---|---|---|---|
1 | Affx-52345991 | 0.01515 | 0 | 0.1813 |
1 | Affx-116685712 | 0 | 0.003356 | 1 |
1 | Affx-474297229 | 0 | 0.003356 | 1 |
... | ||||
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_5.log. Options in effect: --bfile GSE215221_auto_qcSamp --exclude rmSNP_missingTest.txt --geno 0.05 --out GSE215221_auto_5 --write-snplist1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. --exclude: 643809 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998744. 14 variants removed due to missing genotype data (--geno). 643795 variants and 556 people pass filters and QC. Note: No phenotypes present. List of variant IDs written to GSE215221_auto_5.snplist .
Affx-52345991 |
Affx-474295932 |
Affx-474295945 |
... |
plink --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_5.snplist --maf 0.01 --write-snplist --out GSE215221_auto_6
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_6.log. Options in effect: --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_5.snplist --maf 0.01 --out GSE215221_auto_6 --write-snplist1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. --extract: 643795 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998745. 167393 variants removed due to minor allele threshold(s) (--maf/--max-maf/--mac/--max-mac). 476402 variants and 556 people pass filters and QC. Note: No phenotypes present. List of variant IDs written to GSE215221_auto_6.snplist .
n=$(wc -l < GSE215221_auto_6.snplist) # 476402
thres=$(bc -l <<< 0.05/$n)
plink --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_6.snplist --pheno ../GSE215221.phecov --pheno-name CKD --1 --hwe $thres --make-bed --out GSE215221_auto_qcSamp_qcVar
# plink2 --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_6.snplist --pheno ../GSE215221.phecov --pheno-name DKD --1 --keep-if DKD == control --hwe $thres --write-snplist --out GSE215221_auto_7
# plink2 --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_7.snplist --make-bed --out GSE215221_auto_qcSamp_qcVar
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.log. Options in effect: --1 --bfile GSE215221_auto_qcSamp --extract GSE215221_auto_6.snplist --hwe .00000010495337970873 --make-bed --out GSE215221_auto_qcSamp_qcVar --pheno ../GSE215221.phecov --pheno-name CKD1031497 MB RAM detected; reserving 515748 MB for main workspace. 643809 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 364 phenotype values present after --pheno. --extract: 476402 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998504. --hwe: 150 variants removed due to Hardy-Weinberg exact test. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 66 are cases and 298 are controls. (192 phenotypes are missing.) --make-bed to GSE215221_auto_qcSamp_qcVar.bed + GSE215221_auto_qcSamp_qcVar.bim
cd ../PopuStruc
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --out GSE215221_auto_qcSamp_qcVar --pca biallelic-var-wtsStart time: Wed Mar 19 14:10:52 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). Calculating allele frequencies... done. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.eigenval . End time: Wed Mar 19 14:10:58 2025
plink --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000 500 0.2 --out GSE215221_auto_qcSamp_qcVar.cnt
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.cnt.prune.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.pruningCnt
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.cnt.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000 500 0.2 --out GSE215221_auto_qcSamp_qcVar.cnt1031497 MB RAM detected; reserving 515748 MB for main workspace. 476252 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 393 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998506. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 95 are cases and 298 are controls. (163 phenotypes are missing.) Pruned 27917 variants from chromosome 1, leaving 19499. Pruned 18088 variants from chromosome 2, leaving 18821. Pruned 15084 variants from chromosome 3, leaving 16183. Pruned 13425 variants from chromosome 4, leaving 14775. Pruned 12682 variants from chromosome 5, leaving 14284. Pruned 23921 variants from chromosome 6, leaving 13430. Pruned 11436 variants from chromosome 7, leaving 12828. Pruned 10699 variants from chromosome 8, leaving 12176. Pruned 16207 variants from chromosome 9, leaving 11297. Pruned 10987 variants from chromosome 10, leaving 12014. Pruned 10232 variants from chromosome 11, leaving 11061. Pruned 9762 variants from chromosome 12, leaving 11491. Pruned 6483 variants from chromosome 13, leaving 8534. Pruned 6466 variants from chromosome 14, leaving 7921. Pruned 6866 variants from chromosome 15, leaving 8068. Pruned 7243 variants from chromosome 16, leaving 8584. Pruned 6950 variants from chromosome 17, leaving 8093. Pruned 6201 variants from chromosome 18, leaving 7849. Pruned 5760 variants from chromosome 19, leaving 6082. Pruned 5119 variants from chromosome 20, leaving 6790. Pruned 3122 variants from chromosome 21, leaving 3882. Pruned 3821 variants from chromosome 22, leaving 4119. Pruning complete. 238471 of 476252 variants removed. Marker lists written to GSE215221_auto_qcSamp_qcVar.cnt.prune.in and GSE215221_auto_qcSamp_qcVar.cnt.prune.out .
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.pruningCnt.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.cnt.prune.in --out GSE215221_auto_qcSamp_qcVar.pruningCnt --pca biallelic-var-wtsStart time: Wed Mar 19 14:39:31 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 237781 variants remaining. Calculating allele frequencies... done. 237781 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.pruningCnt.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.pruningCnt.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.pruningCnt.eigenval . End time: Wed Mar 19 14:39:34 2025
plink --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000kb 1 0.2 --out GSE215221_auto_qcSamp_qcVar.kb
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.kb.prune.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.pruningKb
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.kb.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000kb 1 0.2 --out GSE215221_auto_qcSamp_qcVar.kb1031497 MB RAM detected; reserving 515748 MB for main workspace. 476252 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 393 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998506. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 95 are cases and 298 are controls. (163 phenotypes are missing.) Pruned 27611 variants from chromosome 1, leaving 19805. Pruned 17969 variants from chromosome 2, leaving 18940. Pruned 14974 variants from chromosome 3, leaving 16293. Pruned 13324 variants from chromosome 4, leaving 14876. Pruned 12553 variants from chromosome 5, leaving 14413. Pruned 23813 variants from chromosome 6, leaving 13538. Pruned 11353 variants from chromosome 7, leaving 12911. Pruned 10636 variants from chromosome 8, leaving 12239. Pruned 16042 variants from chromosome 9, leaving 11462. Pruned 10912 variants from chromosome 10, leaving 12089. Pruned 10125 variants from chromosome 11, leaving 11168. Pruned 9686 variants from chromosome 12, leaving 11567. Pruned 6440 variants from chromosome 13, leaving 8577. Pruned 6427 variants from chromosome 14, leaving 7960. Pruned 6830 variants from chromosome 15, leaving 8104. Pruned 7181 variants from chromosome 16, leaving 8646. Pruned 6905 variants from chromosome 17, leaving 8138. Pruned 6167 variants from chromosome 18, leaving 7883. Pruned 5731 variants from chromosome 19, leaving 6111. Pruned 5081 variants from chromosome 20, leaving 6828. Pruned 3111 variants from chromosome 21, leaving 3893. Pruned 3794 variants from chromosome 22, leaving 4146. Pruning complete. 236665 of 476252 variants removed. Marker lists written to GSE215221_auto_qcSamp_qcVar.kb.prune.in and GSE215221_auto_qcSamp_qcVar.kb.prune.out .
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.pruningKb.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.kb.prune.in --out GSE215221_auto_qcSamp_qcVar.pruningKb --pca biallelic-var-wtsStart time: Wed Mar 19 15:42:36 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 239587 variants remaining. Calculating allele frequencies... done. 239587 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.pruningKb.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.pruningKb.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.pruningKb.eigenval . End time: Wed Mar 19 15:42:39 2025
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --freq --out GSE215221_auto_qcSamp_qcVar
awk 'NR==1{$(NF+1)="pseudoMAF"} NR>1{$(NF+1)=($5>0.5?$5:(1-$5))}1' GSE215221_auto_qcSamp_qcVar.afreq > GSE215221_auto_qcSamp_qcVar.afreq_
plink --bfile ../QC/GSE215221_auto_qcSamp_qcVar --clump GSE215221_auto_qcSamp_qcVar.afreq_ --clump-snp-field ID --clump-field pseudoMAF --clump-p1 1 --clump-p2 1 --out GSE215221_auto_qcSamp_qcVar
awk '{print $3}' GSE215221_auto_qcSamp_qcVar.clumped > GSE215221_auto_qcSamp_qcVar.clumped.in
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.clumped.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.clumping
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --freq --out GSE215221_auto_qcSamp_qcVarStart time: Wed Mar 19 14:45:23 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). Calculating allele frequencies... done. --freq: Allele frequencies (founders only) written to GSE215221_auto_qcSamp_qcVar.afreq . End time: Wed Mar 19 14:45:24 2025
#CHROM | ID | REF | ALT | ALT_FREQS | OBS_CT | pseudoMAF |
---|---|---|---|---|---|---|
1 | Affx-116685712 | A | G | 0.0432432 | 1110 | 0.956757 |
1 | Affx-11739511 | C | T | 0.0665468 | 1112 | 0.933453 |
1 | Affx-13498574 | C | T | 0.158228 | 1106 | 0.841772 |
PLINK v1.90b6.24 64-bit (6 Jun 2021) www.cog-genomics.org/plink/1.9/ (C) 2005-2021 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --clump GSE215221_auto_qcSamp_qcVar.afreq_ --clump-field pseudoMAF --clump-p1 1 --clump-p2 1 --clump-snp-field ID --out GSE215221_auto_qcSamp_qcVar1031497 MB RAM detected; reserving 515748 MB for main workspace. 476252 variants loaded from .bim file. 556 people (258 males, 298 females) loaded from .fam. 393 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 556 founders and 0 nonfounders present. Calculating allele frequencies... done. Total genotyping rate is 0.998506. 476252 variants and 556 people pass filters and QC. Among remaining phenotypes, 95 are cases and 298 are controls. (163 phenotypes are missing.) --clump: 354931 clumps formed from 476252 top variants. Results written to GSE215221_auto_qcSamp_qcVar.clumped .
CHR | F | SNP | BP | P | TOTAL | NSIG | S05 | S01 | S001 | S0001 | SP2 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | Affx-10839387 |
4723640 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | NONE |
1 | 1 | Affx-5149459 |
14183669 | 0.5 | 1 | 1 | 0 | 0 | 0 | 0 | Affx-5150337(1) |
1 | 1 | Affx-5395008 |
15108230 | 0.5 | 0 | 0 | 0 | 0 | 0 | 0 | NONE |
... |
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar_clumping.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.clumped.in --out GSE215221_auto_qcSamp_qcVar.clumping --pca biallelic-var-wtsStart time: Wed Mar 19 15:23:33 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 354931 variants remaining. Calculating allele frequencies... done. 354931 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.clumping.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.clumping.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.clumping.eigenval . End time: Wed Mar 19 15:23:37 2025
curl -o LRLD_GRCh38.txt https://raw.githubusercontent.com/HsinChouYang/TPMI_Cohort/refs/heads/main/Data%20Pre-processing/Genetic%20Data/LRLD.txt
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --exclude range LRLD_GRCh38.txt --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.rmLRLD
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.rmLRLD.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --exclude range LRLD_GRCh38.txt --out GSE215221_auto_qcSamp_qcVar.rmLRLD --pca biallelic-var-wtsStart time: Wed Mar 19 15:47:23 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --exclude bed1: 20616 variants excluded. Calculating allele frequencies... done. 455636 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.rmLRLD.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.rmLRLD.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.rmLRLD.eigenval . End time: Wed Mar 19 15:47:28 2025
for k in {2..10}; do
admixture --cv GSE215221_auto_qcSamp_qcVar.cnt.bed $k | tee log${k}.out
done
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --maf 0.1 --indep-pairwise 5000kb 1 0.2 --out GSE215221_auto_qcSamp_qcVar.maf01_pK
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.in --pca biallelic-var-wts --out GSE215221_auto_qcSamp_qcVar.maf01_pruningKb
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.maf01_pK.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --indep-pairwise 5000kb 1 0.2 --maf 0.1 --out GSE215221_auto_qcSamp_qcVar.maf01_pKStart time: Mon Mar 24 14:48:14 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). Calculating allele frequencies... done. 258045 variants removed due to allele frequency threshold(s) (--maf/--max-maf/--mac/--max-mac). 218207 variants remaining after main filters. --indep-pairwise (12 compute threads): 135759/218207 variants removed. Variant lists written to GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.in and GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.out . End time: Mon Mar 24 14:48:15 2025
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --extract GSE215221_auto_qcSamp_qcVar.maf01_pK.prune.in --out GSE215221_auto_qcSamp_qcVar.maf01_pruningKb --pca biallelic-var-wtsStart time: Mon Mar 24 14:48:47 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (95 cases, 298 controls). --extract: 82448 variants remaining. Calculating allele frequencies... done. 82448 variants remaining after main filters. Constructing GRM: done. Correcting for missingness... done. Extracting eigenvalues and eigenvectors... done. --pca: Variant weights written to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenvec.var . --pca: Eigenvectors written to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenvec , and eigenvalues written to GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenval . End time: Mon Mar 24 14:48:48 2025
cd ../ASSOC
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --covar ../GSE215221.phecov --write-covar --out GSE215221
paste GSE215221.cov ../PopuStruc/GSE215221_auto_qcSamp_qcVar.maf01_pruningKb.eigenvec > GSE215221_wPC.phecov
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --covar ../GSE215221.phecov --out GSE215221 --write-covarStart time: Mon Mar 24 15:37:43 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 1 binary phenotype loaded (95 cases, 298 controls). 4 covariates loaded from ../GSE215221.phecov. Covariates written to GSE215221.cov . End time: Mon Mar 24 15:37:43 2025
#FID |
IID | sex | CKD | DKD | DM | #FID | IID | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
C0031950 | C0031950 | 2 | 0 | 0 | 0 | C0031950 | C0031950 | -0.00526697 | -0.0467302 | 0.0304832 | 0.0234176 | 0.0203419 | 0.0892954 | -0.00586365 | -0.0329808 | -0.0694723 | 0.0102252 |
C0040097 | C0040097 | 1 | 0 | 0 | 0 | C0040097 | C0040097 | 0.0243103 | -0.0240787 | 0.0148346 | -0.0177402 | -0.00743113 | 0.0842174 | -0.0279906 | -0.0309194 | -0.0671104 | -0.0693914 |
C0040101 | C0040101 | 1 | 0 | 0 | 0 | C0040101 | C0040101 | -0.0348148 | 0.0201305 | -0.0819504 | -0.0114334 | -0.071228 | 0.0479485 | -0.00806616 | 0.00514522 | -0.0321499 | 0.0280298 |
... |
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --pheno GSE215221_wPC.phecov --pheno-name CKD --1 --covar GSE215221_wPC.phecov --covar-name sex,PC1-PC10 --covar-variance-standardize --glm no-x-sex hide-covar cols=chrom,pos,ref,alt,a1freq,a1freqcc,gcountcc,nobs,orbeta,se,tz,p,firth,err --out GSE215221
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to GSE215221.log. Options in effect: --1 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --covar GSE215221_wPC.phecov --covar-name sex,PC1-PC10 --covar-variance-standardize --glm no-x-sex hide-covar cols=chrom,pos,ref,alt,a1freq,a1freqcc,gcountcc,nobs,orbeta,se,tz,p,firth,err --out GSE215221 --pheno GSE215221_wPC.phecov --pheno-name CKDStart time: Fri Mar 28 14:19:01 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (66 cases, 298 controls). 11 covariates loaded from GSE215221_wPC.phecov. --covar-variance-standardize: 11 covariates transformed. Calculating allele frequencies... done. Warning: --glm remaining case count is less than 10x predictor count for phenotype 'CKD'. --glm logistic-Firth hybrid regression on phenotype 'CKD': done. Results written to GSE215221.CKD.glm.logistic.hybrid . End time: Fri Mar 28 14:19:02 2025
#CHROM | POS | ID | REF | ALT | A1 | CASE_NON_A1_CT | CASE_HET_A1_CT | CASE_HOM_A1_CT | CTRL_NON_A1_CT | CTRL_HET_A1_CT | CTRL_HOM_A1_CT | A1_FREQ | A1_CASE_FREQ | A1_CTRL_FREQ | FIRTH? | OBS_CT | OR | LOG(OR)_SE | Z_STAT | P | ERRCODE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 607880 | Affx-116685712 | A | G | G | 88 | 7 | 0 | 270 | 27 | 0 | 0.0433673 | 0.0368421 | 0.0454545 | N | 392 | 0.79787 | 0.450643 | -0.501084 | 0.616312 | . |
1 | 629241 | Affx-11739511 | C | T | T | 82 | 13 | 0 | 264 | 34 | 0 | 0.0597964 | 0.0684211 | 0.057047 | N | 393 | 1.27972 | 0.363137 | 0.679199 | 0.497012 | . |
1 | 784860 | Affx-13498574 | C | T | T | 63 | 27 | 4 | 213 | 82 | 3 | 0.156888 | 0.18617 | 0.147651 | N | 392 | 1.41243 | 0.239662 | 1.44083 | 0.149631 | . |
... |
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --export bgen-1.2 bits=8 --out ../QC/GSE215221_auto_qcSamp_qcVar
sed -i '1s/^#//' GSE215221_wPC.phecov # regenie requires header of phenotype file start with: FID IID
PLINK v2.00a3.7LM 64-bit Intel (24 Oct 2022) www.cog-genomics.org/plink/2.0/ (C) 2005-2022 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to ../QC/GSE215221_auto_qcSamp_qcVar.log. Options in effect: --bfile ../QC/GSE215221_auto_qcSamp_qcVar --export bgen-1.2 bits=8 --out ../QC/GSE215221_auto_qcSamp_qcVarStart time: Fri Mar 28 14:15:45 2025 1031497 MiB RAM detected; reserving 515748 MiB for main workspace. Using up to 128 threads (change this with --threads). 556 samples (298 females, 258 males; 556 founders) loaded from ../QC/GSE215221_auto_qcSamp_qcVar.fam. 476252 variants loaded from ../QC/GSE215221_auto_qcSamp_qcVar.bim. 1 binary phenotype loaded (66 cases, 298 controls). Writing ../QC/GSE215221_auto_qcSamp_qcVar.bgen ... done. Writing ../QC/GSE215221_auto_qcSamp_qcVar.sample ... done. End time: Fri Mar 28 14:15:45 2025
# step 1
regenie \
--step 1 \
--bed ../QC/GSE215221_auto_qcSamp_qcVar \
--phenoFile GSE215221_wPC.phecov \
--phenoCol CKD \
--covarFile GSE215221_wPC.phecov \
--covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--bsize 100 \
--bt \
--out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1
# step 2
regenie \
--step 2 \
--bgen ../QC/GSE215221_auto_qcSamp_qcVar.bgen \
--phenoFile GSE215221_wPC.phecov \
--phenoCol CKD \
--covarFile GSE215221_wPC.phecov \
--covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \
--bsize 200 \
--bt \
--firth --approx \
--pThresh 0.01 \
--pred GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list \
--out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2
Start time: Fri Mar 28 14:35:31 2025|===========================| | REGENIE v4.1.gz | |===========================|
Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Compiled with Boost Iostream library.
Log of output saved in file : GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1.log
Options in effect: --step 1 \ --bed ../QC/GSE215221_auto_qcSamp_qcVar \ --phenoFile GSE215221_wPC.phecov \ --phenoCol CKD \ --covarFile GSE215221_wPC.phecov \ --covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \ --bsize 100 \ --bt \ --out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1
Fitting null model * bim : [../QC/GSE215221_auto_qcSamp_qcVar.bim] n_snps = 476252 * fam : [../QC/GSE215221_auto_qcSamp_qcVar.fam] n_samples = 556 * bed : [../QC/GSE215221_auto_qcSamp_qcVar.bed] * phenotypes : [GSE215221_wPC.phecov] n_pheno = 1 -dropping observations with missing values at any of the phenotypes -number of phenotyped individuals with no missing data = 364 * covariates : [GSE215221_wPC.phecov] n_cov = 11 -number of individuals with covariate data = 556 * number of individuals used in analysis = 364 * case-control counts for each trait: - 'CKD': 66 cases and 298 controls -fitting null logistic regression on binary phenotypes...done (0ms) -residualizing and scaling phenotypes...done (0ms) -WARNING: Sample size is less than 5,000 so using LOOCV instead of 5-fold CV. * # threads : [127] * block size : [100] * # blocks : [4775] for 476252 variants * # CV folds : [364] * ridge data_l0 : [ 5 : 0.01 0.25 0.5 0.75 0.99 ] * ridge data_l1 : [ 5 : 0.01 0.25 0.5 0.75 0.99 ] * approximate memory usage : 2GB * setting memory...done
Chromosome 1 block [1] : 100 snps (7ms) -residualizing and scaling genotypes...done (3ms) -calc working matrices...done (1ms) -calc level 0 ridge...done (3ms) block [2] : 100 snps (0ms) -residualizing and scaling genotypes...done (2ms) -calc working matrices...done (1ms) -calc level 0 ridge...done (3ms) ... block [4775] : 40 snps (8ms) -residualizing and scaling genotypes...done (3ms) -calc working matrices...done (0ms) -calc level 0 ridge...done (0ms)
Level 1 ridge with logistic regression... -on phenotype 1 (CKD)... done (8753050ms) B Output ------ phenotype 1 (CKD) : 0.01 : Rsq = 0.0709333, MSE = 0.137931, -logLik/N = 0.437073 0.25 : Rsq = 0.480708, MSE = 0.0994974, -logLik/N = 0.331347 0.5 : Rsq = 0.756199, MSE = 0.0673048, -logLik/N = 0.25897 0.75 : Rsq = 0.889101, MSE = 0.0427481, -logLik/N = 0.213876<- min value 0.99 : Rsq = 0.124539, MSE = 0.553264, -logLik/N = 1.91917 * making predictions...writing LOCO predictions...done (2258965ms)
List of blup files written to: [GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list]
Elapsed time : 11189.1s End time: Fri Mar 28 17:42:00 2025
Start time: Mon Mar 31 12:59:33 2025|===========================| | REGENIE v4.1.gz | |===========================|
Copyright (c) 2020-2024 Joelle Mbatchou, Andrey Ziyatdinov and Jonathan Marchini. Distributed under the MIT License. Compiled with Boost Iostream library.
Log of output saved in file : GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2.log
Options in effect: --step 2 \ --bgen ../QC/GSE215221_auto_qcSamp_qcVar.bgen \ --phenoFile GSE215221_wPC.phecov \ --phenoCol CKD \ --covarFile GSE215221_wPC.phecov \ --covarColList sex,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10 \ --bsize 200 \ --bt \ --firth \ --approx \ --pThresh 0.01 \ --pred GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list \ --out GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2
Association testing mode with fast multithreading using OpenMP * bgen : [../QC/GSE215221_auto_qcSamp_qcVar.bgen] -summary : bgen file (v1.2 layout, zlib compressed) with 556 named samples and 476252 variants with 8-bit encoding. * phenotypes : [GSE215221_wPC.phecov] n_pheno = 1 -dropping observations with missing values at any of the phenotypes -number of phenotyped individuals with no missing data = 364 * covariates : [GSE215221_wPC.phecov] n_cov = 11 -number of individuals with covariate data = 556 * number of individuals used in analysis = 364 * case-control counts for each trait: - 'CKD': 66 cases and 298 controls * LOCO predictions : [GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_pred.list] -file [/data/home/jiawei/Projects/demo/ASSOC/GSE215221_auto_qcSamp_qcVar.CKD.regenie.step1_1.loco] for phenotype 'CKD' * # threads : [127] * block size : [200] * # blocks : [2394] * approximate memory usage : 113MB * using minimum MAC of 5 (variants with lower MAC are ignored) * using fast Firth correction for logistic/cox regression p-values less than 0.01
Chromosome 1 [238 blocks in total] -reading loco predictions for the chromosome...done (0ms) -fitting null logistic regression on binary phenotypes...done (0ms) -fitting null Firth logistic regression on binary phenotypes...done (0ms) block [1/2394] : done (6ms) block [2/2394] : done (0ms) block [3/2394] : done (0ms) ... block [2392/2394] : done (0ms) block [2393/2394] : done (0ms) block [2394/2394] : done (0ms)
Association results stored separately for each trait in files : * [GSE215221_auto_qcSamp_qcVar.CKD.regenie.step2_CKD.regenie]
Number of tests with Firth correction : 55 Number of failed tests : (0/55) Number of ignored tests due to low MAC : 282
Elapsed time : 1.83822s End time: Mon Mar 31 12:59:35 2025
CHROM | GENPOS | ID | ALLELE0 | ALLELE1 | A1FREQ | INFO | N | TEST | BETA | SE | CHISQ | LOG10P | EXTRA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 607880 | Affx-116685712 | A | G | 0.0482094 | 1 | 363 | ADD | 0.371931 | 0.512395 | 0.526884 | 0.329829 | NA |
1 | 629241 | Affx-11739511 | C | T | 0.0631868 | 1 | 364 | ADD | 0.41202 | 0.446452 | 0.8517 | 0.448462 | NA |
1 | 784860 | Affx-13498574 | C | T | 0.146978 | 1 | 364 | ADD | 0.0470915 | 0.326839 | 0.0207595 | 0.0528428 | NA |
... |
cd ../ANNOVAR
prepare input of ANNOVAR
awk '$21<1e-5 {print $1,$2,$2,$4,$5,$3}' ../ASSOC/GSE215221.CKD.glm.logistic.hybrid > GSE215221.CKD.avinput
# in GSE215221.CKD.glm.logistic.hybrid, the p-value is in the 21th column
1 | 11112428 | 11112428 | A | G | Affx-4191030 |
2 | 233772770 | 233772770 | C | T | Affx-19507335 |
2 | 233772898 | 233772898 | C | G | Affx-19507337 |
2 | 233772999 | 233772999 | C | G | Affx-19507343 |
6 | 77637820 | 77637820 | A | G | Affx-75366404 |
13 | 54109045 | 54109045 | G | A | Affx-9497191 |
../tools/annovar/table_annovar.pl GSE215221.CKD.avinput ../tools/annovar/humandb/ -buildver hg38 -out GSE215221.CKD -remove \
-protocol refGene,cytoBand,EAS.sites.2015_08,exac03,avsnp151,dbnsfp47c,gwasCatalog,clinvar_20240611,icgc28,cosmic70 -operation gx,r,f,f,f,f,r,f,f,f \
-nastring . -csvout -polish -xref ../tools/annovar/example/gene_fullxref.txt
cd ../MAGMA
select CHS (Han Chinese South) for reference data
plink --bfile ../tools/magma_v1.10_static/g1000_eas/g1000_eas --keep ../tools/magma_v1.10_static/g1000_subpop/eas_chs.keep --maf 1e-5 --make-bed --out g1000_eas_chs
create annotation file for the data
plink2 --bfile ../QC/GSE215221_auto_qcSamp_qcVar --update-name ../marker_info.txt 2 1 --make-just-bim --out GSE215221_auto_qcSamp_qcVar
grep -v NA GSE215221_auto_qcSamp_qcVar.bim > tmp && mv tmp GSE215221_auto_qcSamp_qcVar.bim
../tools/magma_v1.10_static/magma --annotate \
--snp-loc GSE215221_auto_qcSamp_qcVar.bim \
--gene-loc ../tools/magma_v1.10_static/NCBI38.gene.loc \
--out GSE215221_auto_qcSamp_qcVar
148398 | 1:925741:944581 | rs139858754 rs6680268 rs145857933 rs2340588 rs2340587 rs2880024 rs76964081 rs113171913 rs2340591 rs765383278 |
26155 | 1:944203:959299 | rs2272757 rs10465242 rs117770195 rs13303010 |
339451 | 1:959952:965720 | rs6696971 rs3935066 |
... |
prepare input of MAGMA
awk '{print $3,$21,$17}' ../ASSOC/GSE215221.CKD.glm.logistic.hybrid > GSE215221.CKD.txt
# MAGMA needs only 'SNP' and 'P' columns of rsnumber and p-value (can also provide 'OBS_CT' of sample size), in this case, we should convert affy id to rsnumber (Can use R/Python to do it)
awk 'NR==FNR {a[$1]=$0; next} $1 in a {print a[$1], $0}' GSE215221.CKD.txt ../marker_info.txt | awk 'BEGIN {print "ID P OBS_CT affy_snp_id SNP Ref_Allele Alt_Allele"} {print}' > GSE215221.CKD.minput
ID | P | OBS_CT | affy_snp_id | SNP | Ref_Allele | Alt_Allele |
---|---|---|---|---|---|---|
Affx-2643493 | 0.307699 | 364 | Affx-2643493 | rs10466213 | C | G |
Affx-7455925 | 0.60497 | 361 | Affx-7455925 | rs10770943 | C | G |
Affx-3102334 | 0.748745 | 362 | Affx-3102334 | rs10827221 | G | C |
... |
../tools/magma_v1.10_static/magma --bfile g1000_eas_chs \
--gene-annot GSE215221_auto_qcSamp_qcVar.genes.annot \
--pval GSE215221.CKD.minput ncol=OBS_CT \
--out GSE215221.CKD
Welcome to MAGMA v1.10 (linux/s) Using flags: --bfile g1000_eas_chs --gene-annot GSE215221_auto_qcSamp_qcVar.genes.annot --pval GSE215221.CKD.minput ncol=OBS_CT --out GSE215221.CKDStart time is 15:58:52, Wednesday 02 Apr 2025
Loading PLINK-format data... Reading file g1000_eas_chs.fam... 105 individuals read Reading file g1000_eas_chs.bim... 11707649 SNPs read Preparing file g1000_eas_chs.bed...
Reading SNP p-values from file GSE215221.CKD.minput... detected 7 variables in file using variable: SNP (SNP id) using variable: P (p-value) using variable: OBS_CT (sample size; discarding SNPs with N < 50) read 476253 lines from file, containing valid SNP p-values for 424499 SNPs in data (89.13% of lines, 3.626% of SNPs in data) Loading gene annotation from file GSE215221_auto_qcSamp_qcVar.genes.annot... 16137 gene definitions read from file found 15956 genes containing valid SNPs in genotype data
Starting gene analysis... using model: SNPwise-mean writing gene analysis results to file GSE215221.CKD.genes.out writing intermediate output to file GSE215221.CKD.genes.raw
End time is 15:59:08, Wednesday 02 Apr 2025 (elapsed: 00:00:16)
GENE | CHR | START | STOP | NSNPS | NPARAM | N | ZSTAT | P |
---|---|---|---|---|---|---|---|---|
148398 | 1 | 925741 | 944581 | 8 | 4 | 364 | -0.67723 | 0.75087 |
26155 | 1 | 944203 | 959299 | 4 | 2 | 363 | 1.3106 | 0.094991 |
339451 | 1 | 959952 | 965720 | 2 | 1 | 364 | 1.085 | 0.13895 |
... |