Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SvPileup should ignore reads marked as duplicates #41

Open
ameynert opened this issue Jul 15, 2024 · 5 comments · May be fixed by #42 or #45
Open

SvPileup should ignore reads marked as duplicates #41

ameynert opened this issue Jul 15, 2024 · 5 comments · May be fixed by #42 or #45
Assignees

Comments

@ameynert
Copy link

The number of pileups changes when duplicate reads are removed.

Test examples went from 575 > 283 pileups and 2030 > 550 pileups.

@nh13
Copy link
Member

nh13 commented Jul 15, 2024

I would add an option to the tool to ignore duplicates, with default false for backwards compatibility.

@tfenne
Copy link
Member

tfenne commented Jul 15, 2024

I'm going to disagree slightly and suggest that we add:
val includeDuplicates: Boolean = False
and bump the major version number. Seems like a big footgun to leave in place unless we think there's a good reason the majority of use cases would call for also looking at the duplicates?

@nh13
Copy link
Member

nh13 commented Jul 15, 2024

Major bump would fine by me.

@clintval clintval self-assigned this Dec 10, 2024
@clintval
Copy link
Member

I didn't see #42 until also attempting a solution:

@clintval clintval linked a pull request Dec 11, 2024 that will close this issue
@ameynert
Copy link
Author

ameynert commented Jan 8, 2025

I tested the two branches #42 and #45 that were written to address this issue and they had identical results (yay!). I used a sample with 71,184 read pairs (with UMIs) that could potentially support a rearrangement, i.e. they either spanned an expected breakpoint or had a split read around an expected breakpoint.

Input file Read pairs --include-duplicates=false --include-duplicates=true
No duplicate marking or removal 71,184 771 771
Picard MarkDuplicates only 71,184 262 771
Umitools group & dedup 7,903 296 296

The behaviour appears to be as expected - if duplicate reads are allowed, the first two input files have identical output - they have the same content, with only duplicate reads marked in the "Picard MarkDuplicates only" file. If duplicate reads are not allowed, the number of SV pileups drops to from 771 to 262 but only for the file with marked duplicates. I confirmed that the coordinates of all of the 262 pileups are found in the set of 771.

If UMI de-duplication is used, duplicates are removed entirely because umitools selects a single representative read for each UMI group, so the --include-duplicates flag for fgsv SvPileup has no effect. Running fgsv SvPileup on this file returns 296 pileups, greater than the 262 returned for the input file with only duplicates marked. This is expected because Picard is marking duplicates based only on the aligned coordinates, and UMI tools has the additional information of the UMI tagging the original fragment. However, only 128 of the pileups have the same coordinates between the Picard-marked and UMItools de-duplicated input files, so the Picard set of pileups is not a strict subset of the UMI-tools set of pileups.

Note - the use of the --include-qc-fails flag had no effect on the output as the process used to generate the input files removed low quality reads and did not use the "read fails platform/vendor quality checks (0x200)" SAM flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants