Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods to fix mate info on non-primaries and templates #204

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
8a4e218
Add a function for determining read pair orientation
clintval Dec 24, 2024
cb734da
chore: remove a newline
clintval Dec 24, 2024
3dd8bc4
docs: tweak an arg description
clintval Dec 24, 2024
e4d076d
docs: make hyperlinks for makedocs
clintval Dec 24, 2024
4ee5878
chore: update name of fn to is_properly_paired
clintval Dec 27, 2024
8bca675
chore: update name of fn to is_proper_pair
clintval Dec 27, 2024
b880b65
Deprecate set_pair_info and _set_mate_info for set_mate_info
clintval Dec 27, 2024
56722b0
chore: fix copy-paste formatting issue
clintval Dec 27, 2024
c74834b
Allow insert size calculation to work on 1 read only
clintval Dec 27, 2024
8c69815
Add methods to fix mate info on non-primaries and templates
clintval Dec 27, 2024
386d058
chore: remove set_as_pairs, unit tests for set_mate_info
clintval Dec 27, 2024
d8b2c42
chore: fixup a tiny regression
clintval Dec 27, 2024
a53f735
chore: deprecation after instead of since
clintval Dec 27, 2024
6f3c514
chore: change param names
clintval Dec 27, 2024
4e64644
chore: make arg more generically typed
clintval Dec 27, 2024
8c4ce6f
feat: only grab the mate cigar if absolutely necessary
clintval Dec 27, 2024
dc17e0d
chore: ensure 100% test coverage on diff
clintval Dec 27, 2024
935184b
chore: remove reference to R2
clintval Dec 27, 2024
0f5200f
Merge remote-tracking branch 'origin/main' into cv_properly_paired
clintval Dec 27, 2024
f7e81c4
Merge branch 'cv_properly_paired' into cv_better_mate_info
clintval Dec 27, 2024
e328f45
Merge remote-tracking branch 'origin/cv_better_mate_info' into cv_isize
clintval Dec 27, 2024
8a4973b
chore: fixup based on review
clintval Dec 27, 2024
f858479
chore: remove extra newline
clintval Dec 27, 2024
06938d7
chore: remove small change in tests
clintval Dec 27, 2024
6dea881
chore: add another test
clintval Dec 27, 2024
0d6ab25
Merge remote-tracking branch 'origin/cv_isize' into cv_fix_mates_seco…
clintval Dec 27, 2024
ce520e1
chore: extract common function to private func
clintval Dec 27, 2024
4c7f8b3
chore: further de-duplicate shared code
clintval Dec 27, 2024
8baaa48
chore: remove deprecated code
clintval Dec 27, 2024
206cec5
Merge remote-tracking branch 'origin/main' into cv_fix_mates_secondar…
clintval Dec 28, 2024
f95cc9e
feat: move the function only to Template
clintval Dec 28, 2024
7c4a938
feat: better specify ValueErrors in funcs
clintval Dec 28, 2024
7a2ca2c
chore: fixup docs a bit more
clintval Dec 28, 2024
dd7dba5
chore: fixup docs even more
clintval Dec 28, 2024
841a44a
fix: fixup remaining remaining loose ends
clintval Dec 28, 2024
7140063
chore: finish adding unit tests
clintval Dec 30, 2024
4f2fbea
docs: fix docstring typo
clintval Dec 30, 2024
ec56350
chore: add unit tests for Template.set_mate_info()
clintval Dec 30, 2024
7116964
chore: fix test typo
clintval Dec 30, 2024
716de79
chore: tweak phrasing in docstrings slightly
clintval Jan 7, 2025
8d2617c
Merge remote-tracking branch 'origin/main' into cv_fix_mates_secondar…
clintval Jan 7, 2025
0649c67
feat: mate info can only come from a primary record
clintval Jan 10, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
141 changes: 117 additions & 24 deletions fgpyo/sam/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@
from pysam import AlignedSegment
from pysam import AlignmentFile as SamFile
from pysam import AlignmentHeader as SamHeader
from typing_extensions import Self
from typing_extensions import deprecated

import fgpyo.io
Expand Down Expand Up @@ -863,40 +864,107 @@ def sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> i
return score


def _set_common_mate_fields(dest: AlignedSegment, mate_primary: AlignedSegment) -> None:
"""Set common mate info on a destination alignment from its mate's primary alignment.

Args:
dest: The alignment to set the mate info upon.
mate_primary: The primary alignment to use as a mate reference.

Raises:
ValueError: If dest and mate_primary are of the same read ordinal.
ValueError: If mate_primary is secondary or supplementary.
ValueError: If dest and mate_primary do not share the same query name.
"""
if mate_primary.is_read1 is dest.is_read1:
raise ValueError("mate_primary and dest records must be of different read ordinals!")
if mate_primary.is_supplementary or mate_primary.is_secondary:
raise ValueError("Mate info must be set from a non-supplementary non-secondary record!")
if mate_primary.query_name != dest.query_name:
raise ValueError("Cannot set mate info on alignments with different query names!")

dest.next_reference_id = mate_primary.reference_id
dest.next_reference_name = mate_primary.reference_name
dest.next_reference_start = mate_primary.reference_start
dest.mate_is_forward = mate_primary.is_forward
dest.mate_is_mapped = mate_primary.is_mapped
dest.set_tag("MC", mate_primary.cigarstring)
dest.set_tag("MQ", mate_primary.mapping_quality)
dest.set_tag("ms", sum_of_base_qualities(mate_primary))


def set_mate_info(
r1: AlignedSegment,
r2: AlignedSegment,
rec1: AlignedSegment,
rec2: AlignedSegment,
clintval marked this conversation as resolved.
Show resolved Hide resolved
is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> None:
"""Resets mate pair information between reads in a pair.
"""Resets mate pair information between two primary alignments that share a query name.

Args:
r1: Read 1 (first read in the template).
r2: Read 2 with the same query name as r1 (second read in the template).
rec1: The first record in the pair.
rec2: The second record in the pair.
is_proper_pair: A function that takes the two alignments and determines proper pair status.
isize: A function that takes the two alignments and calculates their isize.

Raises:
ValueError: If rec1 and rec2 are of the same read ordinal.
ValueError: If either rec1 or rec2 is secondary or supplementary.
ValueError: If rec1 and rec2 do not share the same query name.
"""
if r1.query_name != r2.query_name:
raise ValueError("Cannot set mate info on alignments with different query names!")
for dest, source in [(rec1, rec2), (rec2, rec1)]:
_set_common_mate_fields(dest=dest, mate_primary=source)

for src, dest in [(r1, r2), (r2, r1)]:
dest.next_reference_id = src.reference_id
dest.next_reference_name = src.reference_name
dest.next_reference_start = src.reference_start
dest.mate_is_forward = src.is_forward
dest.mate_is_mapped = src.is_mapped
dest.set_tag("MC", src.cigarstring)
dest.set_tag("MQ", src.mapping_quality)
template_length = isize(rec1, rec2)
rec1.template_length = template_length
rec2.template_length = -template_length

r1.set_tag("ms", sum_of_base_qualities(r2))
r2.set_tag("ms", sum_of_base_qualities(r1))
proper_pair = is_proper_pair(rec1, rec2)
rec1.is_proper_pair = proper_pair
rec2.is_proper_pair = proper_pair

template_length = isize(r1, r2)
r1.template_length = template_length
r2.template_length = -template_length

proper_pair = is_proper_pair(r1, r2)
r1.is_proper_pair = proper_pair
r2.is_proper_pair = proper_pair
def set_mate_info_on_secondary(secondary: AlignedSegment, mate_primary: AlignedSegment) -> None:
"""Set mate info on a secondary alignment from its mate's primary alignment.

Args:
secondary: The secondary alignment to set mate information upon.
mate_primary: The primary alignment of the secondary's mate.

Raises:
ValueError: If secondary and mate_primary are of the same read ordinal.
ValueError: If secondary and mate_primary do not share the same query name.
ValueError: If mate_primary is secondary or supplementary.
ValueError: If secondary is not marked as a secondary alignment.
"""
if not secondary.is_secondary:
raise ValueError("Cannot set mate info on an alignment not marked as secondary!")

_set_common_mate_fields(dest=secondary, mate_primary=mate_primary)


def set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None:
"""Set mate info on a supplementary alignment from its mate's primary alignment.

Args:
supp: The supplementary alignment to set mate information upon.
mate_primary: The primary alignment of the supplementary's mate.

Raises:
ValueError: If supp and mate_primary are of the same read ordinal.
ValueError: If supp and mate_primary do not share the same query name.
ValueError: If mate_primary is secondary or supplementary.
ValueError: If supp is not marked as a supplementary alignment.
"""
if not supp.is_supplementary:
raise ValueError("Cannot set mate info on an alignment not marked as supplementary!")

_set_common_mate_fields(dest=supp, mate_primary=mate_primary)

# NB: for a non-secondary supplemental alignment, set the following to the same as the primary.
if not supp.is_secondary:
supp.is_proper_pair = mate_primary.is_proper_pair
supp.template_length = -mate_primary.template_length

Comment on lines +965 to 968
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think template-length is correct, since the mate_primary.template_length is calculated from the primary relative to supp. Should we recalculate it, or should we set it to zero since it's not the primary-primary alignment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm having a hard time seeing how this is wrong. Really wish we could whiteboard!

Here, "mate primary" is the primary alignment of the supplemental's mate. What we want to do is have the supplement's template length be equal to the primary of the supplement (which we unfortunately do not have access to). However, the implementation in fgpyo (and elsewhere) flips the sign of the template length for mates. So, if we flip the sign of the supplemental's mate primary, then we should have the value that would have been set upon the supplemental's primary.

For secondary supplementals (and for secondaries in general) I am not setting template length anywhere because I feel the specification leaves this ambiguous, and a smarter aligner (or custom code) might set specific template lengths on the secondaries and I don't want to override that info. I feel the spec. is ambiguous for secondaries because I don't see reference to RNAME/RNEXT in the definition of TLEN.


@deprecated("Use `set_mate_info()` instead. Deprecated after fgpyo 0.8.0.")
Expand All @@ -922,7 +990,7 @@ def set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = Tr
r2.is_read2 = True
r2.is_read1 = False

set_mate_info(r1=r1, r2=r2, is_proper_pair=lambda a, b: proper_pair)
set_mate_info(rec1=r1, rec2=r2, is_proper_pair=lambda a, b: proper_pair)


@attr.s(frozen=True, auto_attribs=True)
Expand Down Expand Up @@ -1164,6 +1232,31 @@ def all_recs(self) -> Iterator[AlignedSegment]:
for rec in recs:
yield rec

def set_mate_info(
self,
is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> Self:
"""Reset all mate information on every alignment in the template.

Args:
is_proper_pair: A function that takes two alignments and determines proper pair status.
isize: A function that takes the two alignments and calculates their isize.
"""
if self.r1 is not None and self.r2 is not None:
set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
if self.r1 is not None:
for rec in self.r2_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
for rec in self.r2_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
if self.r2 is not None:
for rec in self.r1_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
for rec in self.r1_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
return self
Comment on lines +1246 to +1258
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting this:

Suggested change
if self.r1 is not None and self.r2 is not None:
set_mate_info(self.r1, self.r2, is_proper_pair=is_proper_pair, isize=isize)
if self.r1 is not None:
for rec in self.r2_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r1)
for rec in self.r2_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r1)
if self.r2 is not None:
for rec in self.r1_secondaries:
set_mate_info_on_secondary(secondary=rec, mate_primary=self.r2)
for rec in self.r1_supplementals:
set_mate_info_on_supplementary(supp=rec, mate_primary=self.r2)
return self
if self.r1 is not None:
for rec in self.all_r2s:
_set_common_mate_fields(dest=rec, source=self.r1)
if self.r2 is not None:
for rec in self.all_r1s:
_set_common_mate_fields(dest=rec, source=self.r2)
return self

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my last commit, I still have two separate functions for secondaries and supplementaries. They aren't that different from _set_common_mate_fields except for:

  • Additional checking for if the record is secondary or supplementary for safety sake
  • Proper pair and template length settings on supplementaries (but only for non-secondary supplementaries).


def write_to(
self,
writer: SamFile,
Expand Down
Loading
Loading