Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating Population Frequency for Structural Variants in Whole Exome Sequencing Data from 147 Trios. #53

Open
poddarharsh15 opened this issue Dec 5, 2024 · 2 comments

Comments

@poddarharsh15
Copy link

Hi @jasonbhn

I am working with Whole Exome Sequencing (WES) data from 147 trios (441 samples total). I have merged VCF files containing structural variants (SVs) for these samples and now need to calculate population frequencies to identify common and rare variants. Specifically, I want to focus on variants present in only 10% or less of the population. Can I use your pipeline to estimate this frequency and later Rank my variants.
What tools and methods are best suited for calculating SV frequencies in WES data?[tried BCFtools not successful].
What filtering criteria should be applied to ensure high-quality SV calls before frequency calculation?

Thanks in advance,
HP.

@jasonbhn
Copy link
Owner

Hi,
Because the data contain trios, I would not recommend estimate AF directly from 441 individuals. Rather, I would use something like SVAFotate to look up in public databases such as gnomAD, 1KGP, or CCDG. I think filter QUAL=PASS is important. May I also ask what types of SVs are there in your WES data? Are they primarily inferred CNV, or DUP/DEL that are coding with breakpoints? Also it would be helpful to know the tool you used to generate the SV calls.

Best
Bohan

@poddarharsh15
Copy link
Author

Hi @jasonbhn,

Thanks so much for your response!

To clarify, I’m using the structural variant caller DYSGU [https://github.com/kcleal/dysgu] and my VCF files contain DUP/DEL/INS/TRA/INV variants with breakpoints. For reference, I’ve attached a test.vcf file to give you a clearer picture of the data I’m handling.

I’ve been trying to estimate population frequency within my cohort of 441 samples to retain the variants common to at least 90% of the population that's why I didn't use SVAFotate to estimate the AF. I opted for this approach instead of calculating allele frequency (AF) because after merging the VCF files, I seem to lose a lot of information, particularly GT fields. You can observe this issue in the attached test file.

Let me know if there’s any additional info you need or if you have suggestions to handle this better. I really appreciate your insights and look forward to your advice!

Thanks again! 😊

top_300_rows_with_header.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants