Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality control of Base and Target Data #4

Open
jackosullivanoxford opened this issue Apr 2, 2019 · 11 comments
Open

Quality control of Base and Target Data #4

jackosullivanoxford opened this issue Apr 2, 2019 · 11 comments

Comments

@jackosullivanoxford
Copy link
Collaborator

No description provided.

@jackosullivanoxford
Copy link
Collaborator Author

jackosullivanoxford commented Apr 2, 2019

This describe the necessary quality control measures to perform polygenic risk scores. I have followed this BioRx guide.

The issues to consider are as follows:

  1. Standard GWAS quality control measures (e.g. removing SNPs according to low genotyping rate, minor allele frequency or imputation ‘info score’ and individuals with low genotyping rate) for both GWAS SS and target data (in my case UKBB).
  2. File transfer: Ensure that files have not been corrupted during transfer. Use md5sum to do this.
  3. Genome Build: Ensure that the base and target data SNPs have genomic positions assigned on the same genome build [32]. LiftOver (PMID: 20959295) is an excellent tool for standardizing genome build across different data sets.
  4. Effect allele: Determine which allele in the GWAS SS is the effect allele.
  5. Ambiguous SNPs: If the base and target data were generated using different genotyping chips and the chromosome strand (+/-) for either is unknown, then it is not possible to match ambiguous SNPs (i.e. those with complementary alleles, either C/G or A/T) across the data sets, because it will be unknown whether the base and target data are referring to the same allele or not. While allele frequencies can be used to infer which alleles match [34], we recommend removing all ambiguous SNPs.
  6. Duplicate SNPs: Ensure that there are no duplicated SNPs in either the base or target data.
  7. Sex-check: Do not include sex chromosomes
  8. Sample-overlap: Do any of the individuals in the GWAS SS overlap with individuals in the UKBB
  9. Relatedness: As per LDpred wiki (https://github.com/bvilhjal/ldpred/wiki/Q-and-A): "Relatedness in the validation/target sample is not a concern, however it is a concern for the LD reference panel." # Related individuals have been removed from ldpred reference panel
  10. Heritability check: A critical factor in the accuracy and predictive power of PRS is the power of the base GWAS data [4], and so to avoid reaching misleading conclusions from the application of PRS we recommend first performing a heritability check of the base GWAS data. We suggest using a software such as LD Score regression [8] or LDAK [37] to estimate chip heritability from the GWAS summary statistics, and recommend caution in interpretation of PRS analyses that are performed on GWAS with a low chip-heritability estimate (eg. hsnp2 188 < 0.05).

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issues 1: Quality control
-The GWAS SS from study (Malik et al, PMID: 29531354) underwent standard quality control as outlined by Winkler et al (PMID: 24762786) - note that this is meta-analysis level quality control. Malik et al also did individual study level quality filtering: "Individual study-level filters were set to remove extreme effect values (β > 5 or β <−5), rare SNPs (MAF <0.01) and variants with low imputation accuracy (oevar_imp or info score <0.5). The effective allele count was defined as twice the product of the MAF, imputation accuracy (r2, info score or oevar_imp), and number of cases. Variants with an effective allele count <10 were excluded."

-Our target data (UKBB) was created in PLINK, which, as per the BioRx guide, is an appropriate and standard procedure for quality control.

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 2: File Transfer

-I have done this using md5sum and the file was not corrupted during transfer.

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 3: Genome build

-To do

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 4: Effect allele

-Make sure that the effect allele in the GWAS SS is clear. Done: We have arranged the GWAS SS to the same columns as what is required for LDPred. Below is a list of the required format of GWAS SS for step 1 of LDpred (left side) and what format the MEGASTROKE GWAS SS were in:

Required format for LDpred - MEGASTROKE
chr - (Not present)
pos - (Not present)
ref - Allele2
alt - Allele1 (the is the effect allele
Reffrq (Frequency of the ref allele) - (1 - Freq1) (*Freq1 is the frequency of the effect allele).
info - (Not present), but info is a dummy variable that can be set to 1
rs - Markername
pval - P-value
effalt - Effect (effect of Allele1)
Not present - StdErr

*The original location of this above table is here.

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 5: Ambiguous SNPs

-We removed all SNPs that didn’t have identical rsIDs (see /oak/stanford/groups/euan/projects/ukbb/code/anna_code/risk_scores/Merging_bim_GWAS_SS.py).

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 6: Duplicate SNPs

-I have checked and there are no duplicate SNPs
Code to do this:
dup <- duplicated(ldpred_ss$rs)
table(dup)["TRUE"] # Gives NA
table(dup)["FALSE"] # Gives total number 764,0175

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 7: Sex chromosomes

-I have checked our PLINK files and we have not included chromosome 23. The relevant PLINK files are located here: /oak/stanford/groups/euan/projects/ukbb/code/anna_code/risk_scores/step1_inputs

@jackosullivanoxford
Copy link
Collaborator Author

Quality control issue 8: Sample-overlap: I have done this and there is no overlap.

@jackosullivanoxford
Copy link
Collaborator Author

jackosullivanoxford commented Apr 2, 2019

Quality control issue 9:
As per LDpred wiki (https://github.com/bvilhjal/ldpred/wiki/Q-and-A): "Relatedness in the validation/target sample is not a concern, however it is a concern for the LD reference panel." # Related individuals have been removed from ldpred reference panel

@jackosullivanoxford
Copy link
Collaborator Author

jackosullivanoxford commented Apr 2, 2019

Quality control issue 10: TO DO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant