Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Insert size filtering #214

Merged
merged 11 commits into from
Nov 11, 2023
Merged

Insert size filtering #214

merged 11 commits into from
Nov 11, 2023

Conversation

LudvigOlsen
Copy link
Contributor

@LudvigOlsen LudvigOlsen commented Nov 1, 2023

Since I needed the insert size filtering (#213, #122) for my current analysis, I figured out the nim thing and made it work! :-)

It adds two arguments to the CLI: --min_len and --max_len. I chose the letters l for lower and u for upper.

billede

I tested it on a file and it seemed to work. I asked for fragments of size 110-115 and looked in the *per-base.bed file. All the intervals with a 1 count had a length between 110-115 (majority) or smaller which I guess is in case of overlapping fragments.

I used: cat between_110_115.per-base.bed | awk '$4 == 1 {print $1,$2,$3,$3-$2,$4}' | head -n500
and got:

billede
(Second last column is the interval length)

I've run with each of them separately, which also works.

@LudvigOlsen LudvigOlsen marked this pull request as ready for review November 1, 2023 20:45
@@ -870,7 +878,7 @@ Other options:
stderr.write_line("[mosdepth] error alignment file must be indexed")
quit(2)

var opts = SamField.SAM_FLAG.int or SamField.SAM_RNAME.int or SamField.SAM_POS.int or SamField.SAM_MAPQ.int or SamField.SAM_CIGAR.int
var opts = SamField.SAM_FLAG.int or SamField.SAM_RNAME.int or SamField.SAM_POS.int or SamField.SAM_MAPQ.int or SamField.SAM_CIGAR.int or SamField.SAM_TLEN.int
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed this was necessary but I haven't actually tested if it's the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if it is correct, we can remove the comment in line 883

Ludvig added 2 commits November 1, 2023 22:00
Otherwise, this PR might cause reads with insert size 0 to be excluded, which might be okay but is not the intended change I want to make
@LudvigOlsen
Copy link
Contributor Author

Since I use paired-end sequencing, insert size is the right way to get the fragment length. But for single-read sequencing it might not be? What's the best approach to handling this?

@brentp
Copy link
Owner

brentp commented Nov 2, 2023

Hi @LudvigOlsen , nice job figuring this out. Indeed this looks like you've handled it cleanly and correctly.

isize is actually fragment length, so I think this should work as you have it for single end.

as to the PR, and the general reason I'm hesitant to add features. This feature makes sense, and now you've implemented it, but now I need to maintain, test, and document this. And it adds extra load to the user to have 2 additional parameters that will very rarely be used. That load translates into additional questions.
So, that's my concern with this.

Let me think about what to do here.

@LudvigOlsen
Copy link
Contributor Author

LudvigOlsen commented Nov 2, 2023

Hi @brentp

Good to hear! I think I've seen template lengths of 0 before, so I was worried you might get an insert size of 0 in single-read data.

I understand it's more complex than just the implementation I've suggested. And I really appreciate your work on mosdepth. It's such a great tool!

My colleagues and I, across multiple research groups, tend to need this functionality. And the current approach is to subset large bam files for every experiment, which is a bit of a hassle. mosdepth is amazing and this is the main functionality that I'm missing. For context, I work on cancer detection via whole genome sequenced cell-free DNA. Here, it's common to look at ranges of fragment lengths, as some tumor fragments have been found to be smaller. And in my work specifically, I wish to only look at the common range of fragment sizes across multiple datasets, in order to increase generalization across the datasets.

If I can help with the testing and documentation of these features, do let me know. I'm very new to nim, but do have extensive experience with unit/regression testing.

@odinokov
Copy link

odinokov commented Nov 5, 2023

@LudvigOlsen, it could be a very useful feature. Thank you!

@brentp
Copy link
Owner

brentp commented Nov 9, 2023

@LudvigOlsen would you write a test for this? you can add to functional-tests.sh there should be an error when max-frag-len <= min-frag-len and a test for that. Also a test that runs successfully with those options.
And I think --min-frag-len and --max-frag-len, what do you think?

@LudvigOlsen
Copy link
Contributor Author

Hi @brentp

Sure! I added the error and updated the argument names.

If you want fragments with a specific fragment length, I guess min and max should be allowed to be equal?

I will look at the testing soon

@brentp
Copy link
Owner

brentp commented Nov 9, 2023

If you want fragments with a specific fragment length, I guess min and max should be allowed to be equal?

Ah, yes, that's right. For testing, you can just use one of the current test bams and manually find, for example an exact fragment length, then add that as the test.

@brentp
Copy link
Owner

brentp commented Nov 9, 2023

thank you!

@LudvigOlsen
Copy link
Contributor Author

@brentp I have added some tests. First time with shell-based tests but I think, I figured it out. Perhaps try running them on your side as well though. Any more tests you want added? :-)

@brentp brentp merged commit 8696b5b into brentp:master Nov 11, 2023
0 of 4 checks passed
@brentp
Copy link
Owner

brentp commented Nov 11, 2023

Thanks very much @LudvigOlsen! I'll get out a new release next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants