Calculating Population Frequency for Structural Variants in Whole Exome Sequencing Data from 147 Trios. #53

poddarharsh15 · 2024-12-05T10:49:43Z

I am working with Whole Exome Sequencing (WES) data from 147 trios (441 samples total). I have merged VCF files containing structural variants (SVs) for these samples and now need to calculate population frequencies to identify common and rare variants. Specifically, I want to focus on variants present in only 10% or less of the population. Can I use your pipeline to estimate this frequency and later Rank my variants.
What tools and methods are best suited for calculating SV frequencies in WES data?[tried BCFtools not successful].
What filtering criteria should be applied to ensure high-quality SV calls before frequency calculation?

Thanks in advance,
HP.

jasonbhn · 2024-12-12T17:49:48Z

Hi,
Because the data contain trios, I would not recommend estimate AF directly from 441 individuals. Rather, I would use something like SVAFotate to look up in public databases such as gnomAD, 1KGP, or CCDG. I think filter QUAL=PASS is important. May I also ask what types of SVs are there in your WES data? Are they primarily inferred CNV, or DUP/DEL that are coding with breakpoints? Also it would be helpful to know the tool you used to generate the SV calls.

Best
Bohan

poddarharsh15 · 2024-12-13T09:56:49Z

Hi @jasonbhn,

Thanks so much for your response!

To clarify, I’m using the structural variant caller DYSGU [https://github.com/kcleal/dysgu] and my VCF files contain DUP/DEL/INS/TRA/INV variants with breakpoints. For reference, I’ve attached a test.vcf file to give you a clearer picture of the data I’m handling.

I’ve been trying to estimate population frequency within my cohort of 441 samples to retain the variants common to at least 90% of the population that's why I didn't use SVAFotate to estimate the AF. I opted for this approach instead of calculating allele frequency (AF) because after merging the VCF files, I seem to lose a lot of information, particularly GT fields. You can observe this issue in the attached test file.

Let me know if there’s any additional info you need or if you have suggestions to handle this better. I really appreciate your insights and look forward to your advice!

Thanks again! 😊

top_300_rows_with_header.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating Population Frequency for Structural Variants in Whole Exome Sequencing Data from 147 Trios. #53

Calculating Population Frequency for Structural Variants in Whole Exome Sequencing Data from 147 Trios. #53

poddarharsh15 commented Dec 5, 2024

jasonbhn commented Dec 12, 2024

poddarharsh15 commented Dec 13, 2024

Calculating Population Frequency for Structural Variants in Whole Exome Sequencing Data from 147 Trios. #53

Calculating Population Frequency for Structural Variants in Whole Exome Sequencing Data from 147 Trios. #53

Comments

poddarharsh15 commented Dec 5, 2024

jasonbhn commented Dec 12, 2024

poddarharsh15 commented Dec 13, 2024