Insert size filtering #214

LudvigOlsen · 2023-11-01T20:40:11Z

Since I needed the insert size filtering (#213, #122) for my current analysis, I figured out the nim thing and made it work! :-)

It adds two arguments to the CLI: --min_len and --max_len. I chose the letters l for lower and u for upper.

I tested it on a file and it seemed to work. I asked for fragments of size 110-115 and looked in the *per-base.bed file. All the intervals with a 1 count had a length between 110-115 (majority) or smaller which I guess is in case of overlapping fragments.

I used: cat between_110_115.per-base.bed | awk '$4 == 1 {print $1,$2,$3,$3-$2,$4}' | head -n500
and got:

(Second last column is the interval length)

I've run with each of them separately, which also works.

LudvigOlsen · 2023-11-01T20:51:27Z

mosdepth.nim

@@ -870,7 +878,7 @@ Other options:
    stderr.write_line("[mosdepth] error alignment file must be indexed")
    quit(2)

-  var opts = SamField.SAM_FLAG.int or SamField.SAM_RNAME.int or SamField.SAM_POS.int or SamField.SAM_MAPQ.int or SamField.SAM_CIGAR.int
+  var opts = SamField.SAM_FLAG.int or SamField.SAM_RNAME.int or SamField.SAM_POS.int or SamField.SAM_MAPQ.int or SamField.SAM_CIGAR.int or SamField.SAM_TLEN.int


I assumed this was necessary but I haven't actually tested if it's the case.

And if it is correct, we can remove the comment in line 883

Otherwise, this PR might cause reads with insert size 0 to be excluded, which might be okay but is not the intended change I want to make

LudvigOlsen · 2023-11-02T12:01:21Z

Since I use paired-end sequencing, insert size is the right way to get the fragment length. But for single-read sequencing it might not be? What's the best approach to handling this?

brentp · 2023-11-02T12:16:11Z

Hi @LudvigOlsen , nice job figuring this out. Indeed this looks like you've handled it cleanly and correctly.

isize is actually fragment length, so I think this should work as you have it for single end.

as to the PR, and the general reason I'm hesitant to add features. This feature makes sense, and now you've implemented it, but now I need to maintain, test, and document this. And it adds extra load to the user to have 2 additional parameters that will very rarely be used. That load translates into additional questions.
So, that's my concern with this.

Let me think about what to do here.

LudvigOlsen · 2023-11-02T12:52:09Z

Hi @brentp

Good to hear! I think I've seen template lengths of 0 before, so I was worried you might get an insert size of 0 in single-read data.

I understand it's more complex than just the implementation I've suggested. And I really appreciate your work on mosdepth. It's such a great tool!

My colleagues and I, across multiple research groups, tend to need this functionality. And the current approach is to subset large bam files for every experiment, which is a bit of a hassle. mosdepth is amazing and this is the main functionality that I'm missing. For context, I work on cancer detection via whole genome sequenced cell-free DNA. Here, it's common to look at ranges of fragment lengths, as some tumor fragments have been found to be smaller. And in my work specifically, I wish to only look at the common range of fragment sizes across multiple datasets, in order to increase generalization across the datasets.

If I can help with the testing and documentation of these features, do let me know. I'm very new to nim, but do have extensive experience with unit/regression testing.

odinokov · 2023-11-05T05:25:42Z

@LudvigOlsen, it could be a very useful feature. Thank you!

brentp · 2023-11-09T07:27:35Z

@LudvigOlsen would you write a test for this? you can add to functional-tests.sh there should be an error when max-frag-len <= min-frag-len and a test for that. Also a test that runs successfully with those options.
And I think --min-frag-len and --max-frag-len, what do you think?

LudvigOlsen · 2023-11-09T12:29:09Z

Hi @brentp

Sure! I added the error and updated the argument names.

If you want fragments with a specific fragment length, I guess min and max should be allowed to be equal?

I will look at the testing soon

brentp · 2023-11-09T14:00:55Z

If you want fragments with a specific fragment length, I guess min and max should be allowed to be equal?

Ah, yes, that's right. For testing, you can just use one of the current test bams and manually find, for example an exact fragment length, then add that as the test.

brentp · 2023-11-09T14:01:04Z

thank you!

LudvigOlsen · 2023-11-10T18:20:39Z

@brentp I have added some tests. First time with shell-based tests but I think, I figured it out. Perhaps try running them on your side as well though. Any more tests you want added? :-)

brentp · 2023-11-11T06:21:31Z

Thanks very much @LudvigOlsen! I'll get out a new release next week.

Ludvig added 6 commits November 1, 2023 19:04

Adds length filter in coverage

0f305c9

Adds min_len and max_len to CLI

08f186b

tries isize instead of tlen

6f8bc26

Passes min_len and max_len to coverage()

a6c030b

replaces template length with insert size in docs

a4a160f

Removes bad let

53ba593

LudvigOlsen marked this pull request as ready for review November 1, 2023 20:45

LudvigOlsen commented Nov 1, 2023

View reviewed changes

Ludvig added 2 commits November 1, 2023 22:00

replaces fragment length with insert size in docs

1bf2bbd

Sets min_len default to -1

d35fe31

Otherwise, this PR might cause reads with insert size 0 to be excluded, which might be okay but is not the intended change I want to make

Adheres to argument naming standard ("-" not "_")

5df473f

Updates argnames with "-frag-". Adds error when max < min.

39f73bd

Adds tests of fragment length filtering

4bd8f0c

brentp merged commit 8696b5b into brentp:master Nov 11, 2023
0 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insert size filtering #214

Insert size filtering #214

LudvigOlsen commented Nov 1, 2023 •

edited

Loading

LudvigOlsen Nov 1, 2023

LudvigOlsen Nov 1, 2023

LudvigOlsen commented Nov 2, 2023

brentp commented Nov 2, 2023

LudvigOlsen commented Nov 2, 2023 •

edited

Loading

odinokov commented Nov 5, 2023

brentp commented Nov 9, 2023

LudvigOlsen commented Nov 9, 2023

brentp commented Nov 9, 2023

brentp commented Nov 9, 2023

LudvigOlsen commented Nov 10, 2023

brentp commented Nov 11, 2023

Insert size filtering #214

Insert size filtering #214

Conversation

LudvigOlsen commented Nov 1, 2023 • edited Loading

LudvigOlsen Nov 1, 2023

Choose a reason for hiding this comment

LudvigOlsen Nov 1, 2023

Choose a reason for hiding this comment

LudvigOlsen commented Nov 2, 2023

brentp commented Nov 2, 2023

LudvigOlsen commented Nov 2, 2023 • edited Loading

odinokov commented Nov 5, 2023

brentp commented Nov 9, 2023

LudvigOlsen commented Nov 9, 2023

brentp commented Nov 9, 2023

brentp commented Nov 9, 2023

LudvigOlsen commented Nov 10, 2023

brentp commented Nov 11, 2023

LudvigOlsen commented Nov 1, 2023 •

edited

Loading

LudvigOlsen commented Nov 2, 2023 •

edited

Loading