`SvPileup` should ignore reads marked as duplicates #41

ameynert · 2024-07-15T16:42:21Z

The number of pileups changes when duplicate reads are removed.

Test examples went from 575 > 283 pileups and 2030 > 550 pileups.

nh13 · 2024-07-15T18:24:21Z

I would add an option to the tool to ignore duplicates, with default false for backwards compatibility.

tfenne · 2024-07-15T18:55:35Z

I'm going to disagree slightly and suggest that we add:
val includeDuplicates: Boolean = False
and bump the major version number. Seems like a big footgun to leave in place unless we think there's a good reason the majority of use cases would call for also looking at the duplicates?

nh13 · 2024-07-15T19:17:24Z

Major bump would fine by me.

clintval · 2024-12-11T21:54:21Z

I didn't see #42 until also attempting a solution:

main...cv_duplicate_mark

ameynert · 2025-01-08T00:18:28Z

I tested the two branches #42 and #45 that were written to address this issue and they had identical results (yay!). I used a sample with 71,184 read pairs (with UMIs) that could potentially support a rearrangement, i.e. they either spanned an expected breakpoint or had a split read around an expected breakpoint.

Input file	Read pairs	--include-duplicates=false	--include-duplicates=true
No duplicate marking or removal	71,184	771	771
Picard MarkDuplicates only	71,184	262	771
Umitools group & dedup	7,903	296	296

The behaviour appears to be as expected - if duplicate reads are allowed, the first two input files have identical output - they have the same content, with only duplicate reads marked in the "Picard MarkDuplicates only" file. If duplicate reads are not allowed, the number of SV pileups drops to from 771 to 262 but only for the file with marked duplicates. I confirmed that the coordinates of all of the 262 pileups are found in the set of 771.

If UMI de-duplication is used, duplicates are removed entirely because umitools selects a single representative read for each UMI group, so the --include-duplicates flag for fgsv SvPileup has no effect. Running fgsv SvPileup on this file returns 296 pileups, greater than the 262 returned for the input file with only duplicates marked. This is expected because Picard is marking duplicates based only on the aligned coordinates, and UMI tools has the additional information of the UMI tagging the original fragment. However, only 128 of the pileups have the same coordinates between the Picard-marked and UMItools de-duplicated input files, so the Picard set of pileups is not a strict subset of the UMI-tools set of pileups.

Note - the use of the --include-qc-fails flag had no effect on the output as the process used to generate the input files removed low quality reads and did not use the "read fails platform/vendor quality checks (0x200)" SAM flag.

clintval self-assigned this Dec 10, 2024

clintval linked a pull request Dec 11, 2024 that will close this issue

Defaulted to excluding duplicate and QC failing reads from pileup. #42

Open

clintval linked a pull request Dec 11, 2024 that will close this issue

Ensure SvPileup ignores duplicate records by default #45

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SvPileup` should ignore reads marked as duplicates #41

`SvPileup` should ignore reads marked as duplicates #41

ameynert commented Jul 15, 2024

nh13 commented Jul 15, 2024

tfenne commented Jul 15, 2024

nh13 commented Jul 15, 2024

clintval commented Dec 11, 2024

ameynert commented Jan 8, 2025

SvPileup should ignore reads marked as duplicates #41

SvPileup should ignore reads marked as duplicates #41

Comments

ameynert commented Jul 15, 2024

nh13 commented Jul 15, 2024

tfenne commented Jul 15, 2024

nh13 commented Jul 15, 2024

clintval commented Dec 11, 2024

ameynert commented Jan 8, 2025

`SvPileup` should ignore reads marked as duplicates #41

`SvPileup` should ignore reads marked as duplicates #41