-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add methods to fix mate info on non-primaries and templates #204
base: main
Are you sure you want to change the base?
Conversation
30dbc8a
to
1c4f647
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #204 +/- ##
==========================================
+ Coverage 91.06% 91.44% +0.37%
==========================================
Files 18 18
Lines 2283 2314 +31
Branches 337 349 +12
==========================================
+ Hits 2079 2116 +37
+ Misses 133 124 -9
- Partials 71 74 +3 ☔ View full report in Codecov by Sentry. |
491777b
to
c755f6b
Compare
1c4f647
to
b441d30
Compare
c755f6b
to
56d9692
Compare
b441d30
to
32024b9
Compare
If early review is favorable, I'll go and finish the PR with unit tests! |
32024b9
to
8c69815
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the code for setting values on supplementaries is a good addition. I'm still a little wary of the choices for secondaries ... if anyone is using an aligner that is better than bwa at reporting secondary alignments, this might override more useful information with arguably incorrect values.
WalkthroughThe changes enhance the handling of mate information in SAM (Sequence Alignment/Map) records. A new private function Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
fgpyo/sam/__init__.py (2)
867-895
: Good extraction of common logic into_set_common_mate_fields
.This helper function is clean and well-documented. One minor point: consider clarifying in the docstring whether
source
may be unmapped or not, given that the function still sets mate-score tags unconditionally regardless of mapping status.
1239-1263
: Template-levelset_mate_info
usage is well structured.The stepwise approach for applying mate info to R1, R2, secondaries, and supplementals is logical. If you add additional flags or references in the future, consider factoring out repeated sections for clarity.
tests/fgpyo/sam/test_sam.py (1)
713-714
: Minor spelling nitpick in the test name.It should be "mismatched" rather than "mimatched" in
test_set_mate_info_raises_mimatched_query_names
.-def test_set_mate_info_raises_mimatched_query_names() -> None: +def test_set_mate_info_raises_mismatched_query_names() -> None:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
fgpyo/sam/__init__.py
(4 hunks)tests/fgpyo/sam/test_sam.py
(3 hunks)tests/fgpyo/sam/test_template_iterator.py
(1 hunks)
🔇 Additional comments (16)
fgpyo/sam/__init__.py (3)
181-181
: Imports approved.
Using Self
and deprecated
here is acceptable to maintain compatibility with older versions of Python while offering newer features.
897-935
: Solid pair info update in set_mate_info
.
The approach of calling _set_common_mate_fields
twice (once per direction) and then setting TLEN and proper-pair bits is clear. No issues found with error handling and logic flow.
927-971
: Supplementary and secondary info updates look good.
Both set_mate_info_on_secondary
and set_mate_info_on_supplementary
handle relevant checks thoroughly. The approach of disallowing a supplementary mate source is consistent with your docstring. No direct code issues spotted.
tests/fgpyo/sam/test_template_iterator.py (2)
222-224
: No issues at these lines.
225-389
: Thorough testing in test_template_set_mate_info
.
Your test covers both primary and secondary, supplementary, and secondary-supp combos. The coverage looks comprehensive and confirms correctness of the new methods.
tests/fgpyo/sam/test_sam.py (11)
24-26
: Import statements are fine.
689-698
: Correct check for read ordinal mismatch in test_set_mate_info_raises_not_opposite_read_ordinals
.
This test correctly ensures error-raising if both reads are flagged R1 or both R2.
700-708
: Proper error test for supplementary second record.
832-860
: test_set_mate_info_on_secondary
successfully verifies mate info for a secondary record.
862-889
: Robust negative tests in test_set_mate_info_on_secondary_raises_for_secondary_or_supp_rec2
.
These do help confirm that we only set mate info from a primary mate. No issues spotted.
891-919
: test_set_mate_info_on_supplementary
logic mirrors the secondary approach.
921-948
: Validation checks for a primary mate are consistent in test_set_mate_info_on_supplementary_raises_for_secondary_or_supp_rec2
.
950-964
: Additional field-setting for primary supplements is well-tested.
966-981
: Secondary + supplementary check is thorough.
Ensures that no primary-based fields are overwritten on secondaries. Good.
983-994
: test_set_pair_info_raises_exception_for_mismatched_query_names
is a fine negative test.
996-1028
: Verifies legacy set_pair_info
for two mapped records.
Nicely ensures backward compatibility for older usage patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/fgpyo/sam/test_sam.py (1)
832-889
: Add test for mapped secondary alignment.Current tests only cover unmapped case. Add test with mapped secondary alignment to ensure complete coverage.
def test_set_mate_info_on_secondary_mapped(): """Test set_mate_info_on_secondary with mapped secondary alignment.""" builder = SamBuilder() secondary, primary = builder.add_pair(chrom="chr1", start1=100, start2=200) secondary.is_secondary = True set_mate_info_on_secondary(secondary, primary) assert secondary.next_reference_start == primary.reference_start assert secondary.template_length == primary.template_length
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/fgpyo/sam/test_sam.py
(3 hunks)
🔇 Additional comments (2)
tests/fgpyo/sam/test_sam.py (2)
689-708
: LGTM: Validation tests for set_mate_info are thorough.
Tests cover essential validation cases for read ordinals and supplementary alignments.
891-981
: LGTM: Comprehensive supplementary alignment tests.
Tests cover both primary and secondary supplementary cases with proper validation of mate info fields.
@nh13 can you take a look through this one and give your opinions please? |
fgpyo/sam/__init__.py
Outdated
dest.mate_is_mapped = source.is_mapped | ||
dest.set_tag("MC", source.cigarstring) | ||
dest.set_tag("MQ", source.mapping_quality) | ||
dest.set_tag("ms", sum_of_base_qualities(source)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank-you!
if not supp.is_secondary: | ||
supp.is_proper_pair = mate_primary.is_proper_pair | ||
supp.template_length = -mate_primary.template_length | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think template-length is correct, since the mate_primary.template_length
is calculated from the primary
relative to supp
. Should we recalculate it, or should we set it to zero since it's not the primary-primary alignment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm having a hard time seeing how this is wrong. Really wish we could whiteboard!
Here, "mate primary" is the primary alignment of the supplemental's mate. What we want to do is have the supplement's template length be equal to the primary of the supplement (which we unfortunately do not have access to). However, the implementation in fgpyo
(and elsewhere) flips the sign of the template length for mates. So, if we flip the sign of the supplemental's mate primary, then we should have the value that would have been set upon the supplemental's primary.
For secondary supplementals (and for secondaries in general) I am not setting template length anywhere because I feel the specification leaves this ambiguous, and a smarter aligner (or custom code) might set specific template lengths on the secondaries and I don't want to override that info. I feel the spec. is ambiguous for secondaries because I don't see reference to RNAME
/RNEXT
in the definition of TLEN
.
if self.r1 is not None and self.r2 is not None: | ||
set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize) | ||
if self.r1 is not None: | ||
for rec in self.r2_secondaries: | ||
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1) | ||
for rec in self.r2_supplementals: | ||
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1) | ||
if self.r2 is not None: | ||
for rec in self.r1_secondaries: | ||
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2) | ||
for rec in self.r1_supplementals: | ||
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2) | ||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting this:
if self.r1 is not None and self.r2 is not None: | |
set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize) | |
if self.r1 is not None: | |
for rec in self.r2_secondaries: | |
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1) | |
for rec in self.r2_supplementals: | |
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1) | |
if self.r2 is not None: | |
for rec in self.r1_secondaries: | |
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2) | |
for rec in self.r1_supplementals: | |
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2) | |
return self | |
if self.r1 is not None: | |
for rec in self.all_r2s: | |
_set_common_mate_fields(dest=rec, source=self.r1) | |
if self.r2 is not None: | |
for rec in self.all_r1s: | |
_set_common_mate_fields(dest=rec, source=self.r2) | |
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With my last commit, I still have two separate functions for secondaries and supplementaries. They aren't that different from _set_common_mate_fields
except for:
- Additional checking for if the record is secondary or supplementary for safety sake
- Proper pair and template length settings on supplementaries (but only for non-secondary supplementaries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
tests/fgpyo/sam/test_sam.py (4)
694-696
: Make error message more descriptive.Current message could be clearer about what "different read ordinals" means.
- ValueError, match="mate_primary and dest records must be of different read ordinals!" + ValueError, match="mate_primary and dest records must be from different reads (R1/R2)!"
834-862
: Add test cases for mapped records.Current test only covers unmapped records. Add cases for mapped records to ensure proper handling of reference coordinates and flags.
894-922
: Add test cases for mapped records.Similar to secondary test, add cases for mapped records to verify reference coordinates and flags.
Line range hint
1246-1258
: Consider alternative implementation.Per PR comments, consider restructuring to:
- Process all R2s with R1 primary
- Process all R1s with R2 primary
This approach might be more maintainable.
- if self.r1 is not None and self.r2 is not None: - set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize) - if self.r1 is not None: - for rec in self.r2_secondaries: - set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1) - for rec in self.r2_supplementals: - set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1) - if self.r2 is not None: - for rec in self.r1_secondaries: - set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2) - for rec in self.r1_supplementals: - set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2) + if self.r1 is not None: + for rec in self.all_r2s(): + _set_common_mate_fields(dest=rec, source=self.r1) + if self.r2 is not None: + for rec in self.all_r1s(): + _set_common_mate_fields(dest=rec, source=self.r2)fgpyo/sam/__init__.py (1)
899-900
: Add type hints for callback parameters.Add return type hints for is_proper_pair and isize callbacks.
- is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, - isize: Callable[[AlignedSegment, AlignedSegment], int] = isize, + is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair, # type: -> bool + isize: Callable[[AlignedSegment, AlignedSegment], int] = isize, # type: -> int
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fgpyo/sam/__init__.py
(4 hunks)tests/fgpyo/sam/test_sam.py
(3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: testing (3.10)
- GitHub Check: testing (3.9)
🔇 Additional comments (2)
tests/fgpyo/sam/test_sam.py (1)
965-967
: Verify template length calculation.Per discussion in PR comments, template length might need recalculation or zeroing since it's not a primary-primary alignment.
fgpyo/sam/__init__.py (1)
965-967
: Review template length handling.Verify if negating mate's template length is correct for supplementary alignments. Consider:
- Recalculating based on alignment positions
- Setting to zero for non-primary alignments
Closes #83
Closes #182
I need a set of functions to make it easier to set mate info on secondary, supplemental, "secondary supplemental", and templates. The work in this PR builds on the work in:
This PR introduces:
set_mate_info_on_secondary
set_mate_info_on_supplementary
set_mate_info_for_template
Template.set_mate_info()
All "mate" fields are set using the primary of the opposite read ordinal.
However, I am seeking advice on how we should set:
is_proper_pair
template_length
For Secondary Alignments
For secondary alignments, I personally am curious about those values not as they are computed on the two primary alignments, but between the secondary alignment and the primary of the opposite ordinal (e.g. if this secondary was primary, then what would it look like paired with the primary of the opposite ordinal).
For Supplementary Alignments
For supplementary alignments of the primary alignment, it makes sense to have these two values point to the primary alignment of the mate. However, for secondary supplemental alignments, I think these values are undefined without knowing more about the specific alignment situation, so I am choosing to leave them as-is.