Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add methods to fix mate info on non-primaries and templates #204

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
8a4e218
Add a function for determining read pair orientation
clintval Dec 24, 2024
cb734da
chore: remove a newline
clintval Dec 24, 2024
3dd8bc4
docs: tweak an arg description
clintval Dec 24, 2024
e4d076d
docs: make hyperlinks for makedocs
clintval Dec 24, 2024
4ee5878
chore: update name of fn to is_properly_paired
clintval Dec 27, 2024
8bca675
chore: update name of fn to is_proper_pair
clintval Dec 27, 2024
b880b65
Deprecate set_pair_info and _set_mate_info for set_mate_info
clintval Dec 27, 2024
56722b0
chore: fix copy-paste formatting issue
clintval Dec 27, 2024
c74834b
Allow insert size calculation to work on 1 read only
clintval Dec 27, 2024
8c69815
Add methods to fix mate info on non-primaries and templates
clintval Dec 27, 2024
386d058
chore: remove set_as_pairs, unit tests for set_mate_info
clintval Dec 27, 2024
d8b2c42
chore: fixup a tiny regression
clintval Dec 27, 2024
a53f735
chore: deprecation after instead of since
clintval Dec 27, 2024
6f3c514
chore: change param names
clintval Dec 27, 2024
4e64644
chore: make arg more generically typed
clintval Dec 27, 2024
8c4ce6f
feat: only grab the mate cigar if absolutely necessary
clintval Dec 27, 2024
dc17e0d
chore: ensure 100% test coverage on diff
clintval Dec 27, 2024
935184b
chore: remove reference to R2
clintval Dec 27, 2024
0f5200f
Merge remote-tracking branch 'origin/main' into cv_properly_paired
clintval Dec 27, 2024
f7e81c4
Merge branch 'cv_properly_paired' into cv_better_mate_info
clintval Dec 27, 2024
e328f45
Merge remote-tracking branch 'origin/cv_better_mate_info' into cv_isize
clintval Dec 27, 2024
8a4973b
chore: fixup based on review
clintval Dec 27, 2024
f858479
chore: remove extra newline
clintval Dec 27, 2024
06938d7
chore: remove small change in tests
clintval Dec 27, 2024
6dea881
chore: add another test
clintval Dec 27, 2024
0d6ab25
Merge remote-tracking branch 'origin/cv_isize' into cv_fix_mates_seco…
clintval Dec 27, 2024
ce520e1
chore: extract common function to private func
clintval Dec 27, 2024
4c7f8b3
chore: further de-duplicate shared code
clintval Dec 27, 2024
8baaa48
chore: remove deprecated code
clintval Dec 27, 2024
206cec5
Merge remote-tracking branch 'origin/main' into cv_fix_mates_secondar…
clintval Dec 28, 2024
f95cc9e
feat: move the function only to Template
clintval Dec 28, 2024
7c4a938
feat: better specify ValueErrors in funcs
clintval Dec 28, 2024
7a2ca2c
chore: fixup docs a bit more
clintval Dec 28, 2024
dd7dba5
chore: fixup docs even more
clintval Dec 28, 2024
841a44a
fix: fixup remaining remaining loose ends
clintval Dec 28, 2024
7140063
chore: finish adding unit tests
clintval Dec 30, 2024
4f2fbea
docs: fix docstring typo
clintval Dec 30, 2024
ec56350
chore: add unit tests for Template.set_mate_info()
clintval Dec 30, 2024
7116964
chore: fix test typo
clintval Dec 30, 2024
716de79
chore: tweak phrasing in docstrings slightly
clintval Jan 7, 2025
8d2617c
Merge remote-tracking branch 'origin/main' into cv_fix_mates_secondar…
clintval Jan 7, 2025
0649c67
feat: mate info can only come from a primary record
clintval Jan 10, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
340 changes: 312 additions & 28 deletions fgpyo/sam/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,10 +158,12 @@
import enum
import io
import sys
from collections.abc import Collection
from itertools import chain
from pathlib import Path
from typing import IO
from typing import Any
from typing import Callable
from typing import Dict
from typing import Iterable
from typing import Iterator
Expand All @@ -176,6 +178,7 @@
from pysam import AlignedSegment
from pysam import AlignmentFile as SamFile
from pysam import AlignmentHeader as SamHeader
from typing_extensions import deprecated

import fgpyo.io
from fgpyo.collections import PeekableIterator
Expand Down Expand Up @@ -607,6 +610,161 @@
return self.elements[index]


@enum.unique
class PairOrientation(enum.Enum):
"""Enumerations of read pair orientations."""

FR = "FR"
"""A pair orientation for forward-reverse reads ("innie")."""

RF = "RF"
"""A pair orientation for reverse-forward reads ("outie")."""

TANDEM = "TANDEM"
"""A pair orientation for tandem (forward-forward or reverse-reverse) reads."""

@classmethod
def from_recs( # noqa: C901 # `from_recs` is too complex (11 > 10)
cls, rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None
) -> Optional["PairOrientation"]:
"""Returns the pair orientation if both reads are mapped to the same reference sequence.

Args:
rec1: The first record in the pair.
rec2: The second record in the pair. If None, then mate info on `rec1` will be used.

See:
[`htsjdk.samtools.SamPairUtil.getPairOrientation()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L71-L102)
"""

if rec2 is None:
rec2_is_unmapped = rec1.mate_is_unmapped
rec2_reference_id = rec1.next_reference_id
else:
rec2_is_unmapped = rec2.is_unmapped
rec2_reference_id = rec2.reference_id

if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
return None

if rec2 is None:
rec2_is_forward = rec1.mate_is_forward
rec2_reference_start = rec1.next_reference_start
else:
rec2_is_forward = rec2.is_forward
rec2_reference_start = rec2.reference_start

if rec1.is_forward is rec2_is_forward:
return PairOrientation.TANDEM
if rec1.is_forward and rec1.reference_start <= rec2_reference_start:
return PairOrientation.FR
if rec1.is_reverse and rec2_reference_start < rec1.reference_end:
return PairOrientation.FR
if rec1.is_reverse and rec2_reference_start >= rec1.reference_end:
return PairOrientation.RF

if rec2 is None:
if not rec1.has_tag("MC"):
raise ValueError('Cannot determine proper pair status without a mate cigar ("MC")!')
rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
else:
rec2_reference_end = rec2.reference_end

if rec1.reference_start < rec2_reference_end:
return PairOrientation.FR
else:
return PairOrientation.RF


def isize(rec1: AlignedSegment, rec2: Optional[AlignedSegment] = None) -> int:
"""Computes the insert size ("template length" or "TLEN") for a pair of records.

Args:
rec1: The first record in the pair.
rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
"""
if rec2 is None:
rec2_is_unmapped = rec1.mate_is_unmapped
rec2_reference_id = rec1.next_reference_id
else:
rec2_is_unmapped = rec2.is_unmapped
rec2_reference_id = rec2.reference_id

if rec1.is_unmapped or rec2_is_unmapped or rec1.reference_id != rec2_reference_id:
return 0

if rec2 is None:
rec2_is_forward = rec1.mate_is_forward
rec2_reference_start = rec1.next_reference_start
else:
rec2_is_forward = rec2.is_forward
rec2_reference_start = rec2.reference_start

if rec1.is_forward and rec2_is_forward:
return rec2_reference_start - rec1.reference_start
if rec1.is_reverse and rec2_is_forward:
return rec2_reference_start - rec1.reference_end

if rec2 is None:
if not rec1.has_tag("MC"):
raise ValueError('Cannot determine proper pair status without a mate cigar ("MC")!')
rec2_cigar = Cigar.from_cigarstring(str(rec1.get_tag("MC")))
rec2_reference_end = rec1.next_reference_start + rec2_cigar.length_on_target()
else:
rec2_reference_end = rec2.reference_end

if rec1.is_forward:
return rec2_reference_end - rec1.reference_start
else:
return rec2_reference_end - rec1.reference_end


DefaultProperlyPairedOrientations: set[PairOrientation] = {PairOrientation.FR}
"""The default orientations for properly paired reads."""


def is_proper_pair(
rec1: AlignedSegment,
rec2: Optional[AlignedSegment] = None,
max_insert_size: int = 1000,
orientations: Collection[PairOrientation] = DefaultProperlyPairedOrientations,
isize: Callable[[AlignedSegment, AlignedSegment], int] = isize,
) -> bool:
"""Determines if a pair of records are properly paired or not.

Criteria for records in a proper pair are:
- Both records are aligned
- Both records are aligned to the same reference sequence
- The pair orientation of the records is one of the valid pair orientations (default "FR")
- The inferred insert size is not more than a maximum length (default 1000)

Args:
rec1: The first record in the pair.
rec2: The second record in the pair. If None, then mate info on `rec1` will be used.
max_insert_size: The maximum insert size to consider a pair "proper".
orientations: The valid set of orientations to consider a pair "proper".
isize: A function that takes the two alignments and calculates their isize.

See:
[`htsjdk.samtools.SamPairUtil.isProperPair()`](https://github.com/samtools/htsjdk/blob/c31bc92c24bc4e9552b2a913e52286edf8f8ab96/src/main/java/htsjdk/samtools/SamPairUtil.java#L106-L125)
"""
if rec2 is None:
rec2_is_mapped = rec1.mate_is_mapped
rec2_reference_id = rec1.next_reference_id
else:
rec2_is_mapped = rec2.is_mapped
rec2_reference_id = rec2.reference_id

return (
rec1.is_mapped
and rec2_is_mapped
and rec1.reference_id == rec2_reference_id
and PairOrientation.from_recs(rec1=rec1, rec2=rec2) in orientations
and 0 < abs(isize(rec1, rec2)) <= max_insert_size
)


@attr.s(frozen=True, auto_attribs=True)
class SupplementaryAlignment:
"""Stores a supplementary alignment record produced by BWA and stored in the SA SAM tag.
Expand Down Expand Up @@ -688,48 +846,145 @@
return []


def isize(r1: AlignedSegment, r2: AlignedSegment) -> int:
"""Computes the insert size for a pair of records."""
if r1.is_unmapped or r2.is_unmapped or r1.reference_id != r2.reference_id:
return 0
else:
r1_pos = r1.reference_end if r1.is_reverse else r1.reference_start
r2_pos = r2.reference_end if r2.is_reverse else r2.reference_start
return r2_pos - r1_pos
def sum_of_base_qualities(rec: AlignedSegment, min_quality_score: int = 15) -> int:
"""Calculate the sum of base qualities score for an alignment record.

This function is useful for calculating the "mate score" as implemented in samtools fixmate.

Args:
rec: The alignment record to calculate the sum of base qualities from.
min_quality_score: The minimum base quality score to use for summation.

See:
[`calc_sum_of_base_qualities()`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L227-L238)
[`MD_MIN_QUALITY`](https://github.com/samtools/samtools/blob/4f3a7397a1f841020074c0048c503a01a52d5fa2/bam_mate.c#L42)
"""
score: int = sum(qual for qual in rec.query_qualities if qual >= min_quality_score)
return score


def _set_common_mate_fields(dest: AlignedSegment, source: AlignedSegment) -> None:
"""Set common mate info an alignment to its mate's primary alignment.

Args:
dest: The alignment to set the mate info upon.
source: The primary alignment to use as a mate reference.
"""
if source.is_read1 is dest.is_read1 or source.is_secondary or source.is_supplementary:
clintval marked this conversation as resolved.
Show resolved Hide resolved
raise ValueError("Mate info must be set from a primary of the next ordinal!")

Check warning on line 874 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L874

Added line #L874 was not covered by tests
if source.query_name != dest.query_name:
raise ValueError("Cannot set mate info on alignments with different query names!")

Check warning on line 876 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L876

Added line #L876 was not covered by tests

dest.next_reference_id = source.reference_id
dest.next_reference_name = source.reference_name
dest.next_reference_start = source.reference_start
dest.mate_is_forward = source.is_forward
dest.mate_is_mapped = source.is_mapped
dest.set_tag("MC", source.cigarstring)
dest.set_tag("MQ", source.mapping_quality)
dest.set_tag("ms", sum_of_base_qualities(source))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank-you!



def set_mate_info(
r1: AlignedSegment,
r2: AlignedSegment,
is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
) -> None:
"""Resets mate pair information between reads in a pair.

Args:
r1: Read 1 (first read in the template).
r2: Read 2 with the same query name as r1 (second read in the template).
is_proper_pair: A function that takes the two alignments and determines proper pair status.
"""
if r1.query_name != r2.query_name:
raise ValueError("Cannot set mate info on alignments with different query names!")

for dest, source in [(r1, r2), (r2, r1)]:
_set_common_mate_fields(dest=dest, source=source)

template_length = isize(r1, r2)
r1.template_length = template_length
r2.template_length = -template_length

proper_pair = is_proper_pair(r1, r2)
r1.is_proper_pair = proper_pair
r2.is_proper_pair = proper_pair


def set_mate_info_on_secondary(
secondary: AlignedSegment,
mate_primary: AlignedSegment,
is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
) -> None:
"""Set mate info on a secondary alignment to its mate's primary alignment.

Args:
secondary: The secondary alignment to set mate information upon.
mate_primary: The primary alignment of the secondary's mate.
is_proper_pair: A function that takes the two alignments and determines proper pair status.

Raises:
ValueError: If primary and secondary are of the same read ordinal.
ValueError: If the primary is marked as either secondary or supplementary.
ValueError: If the secondary is not marked as secondary.
"""
if not secondary.is_secondary:
raise ValueError("Cannot set mate info on an alignment not marked as secondary!")

Check warning on line 933 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L933

Added line #L933 was not covered by tests

_set_common_mate_fields(dest=secondary, source=mate_primary)

Check warning on line 935 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L935

Added line #L935 was not covered by tests

# NB: calculate isize and proper pair as if this secondary alignment was the primary alignment.
secondary.is_proper_pair = is_proper_pair(mate_primary, secondary)
secondary.template_length = isize(mate_primary, secondary)

Check warning on line 939 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L938-L939

Added lines #L938 - L939 were not covered by tests


def set_mate_info_on_supplementary(supp: AlignedSegment, mate_primary: AlignedSegment) -> None:
"""Set mate info on a supplementary alignment to its mate's primary alignment.

Args:
supp: The supplementary alignment to set mate information upon.
mate_primary: The primary alignment of the supplementary's mate.

Raises:
ValueError: If primary and secondary are of the same read ordinal.
ValueError: If the primary is marked as either secondary or supplementary.
ValueError: If the secondary is not marked as secondary.
"""
if not supp.is_supplementary:
raise ValueError("Cannot set mate info on an alignment not marked as supplementary!")

Check warning on line 955 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L955

Added line #L955 was not covered by tests

_set_common_mate_fields(dest=supp, source=mate_primary)

Check warning on line 957 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L957

Added line #L957 was not covered by tests

# NB: for a non-secondary supplemental alignment, set the following to the same as the primary.
if not supp.is_secondary:
supp.is_proper_pair = mate_primary.is_proper_pair
supp.template_length = -mate_primary.template_length

Check warning on line 962 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L961-L962

Added lines #L961 - L962 were not covered by tests

Comment on lines +965 to 968
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think template-length is correct, since the mate_primary.template_length is calculated from the primary relative to supp. Should we recalculate it, or should we set it to zero since it's not the primary-primary alignment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm having a hard time seeing how this is wrong. Really wish we could whiteboard!

Here, "mate primary" is the primary alignment of the supplemental's mate. What we want to do is have the supplement's template length be equal to the primary of the supplement (which we unfortunately do not have access to). However, the implementation in fgpyo (and elsewhere) flips the sign of the template length for mates. So, if we flip the sign of the supplemental's mate primary, then we should have the value that would have been set upon the supplemental's primary.

For secondary supplementals (and for secondaries in general) I am not setting template length anywhere because I feel the specification leaves this ambiguous, and a smarter aligner (or custom code) might set specific template lengths on the secondaries and I don't want to override that info. I feel the spec. is ambiguous for secondaries because I don't see reference to RNAME/RNEXT in the definition of TLEN.


@deprecated("Use `set_mate_info()` instead. Deprecated after fgpyo 0.8.0.")
def set_pair_info(r1: AlignedSegment, r2: AlignedSegment, proper_pair: bool = True) -> None:
"""Resets mate pair information between reads in a pair. Requires that both r1
and r2 are mapped. Can be handed reads that already have pairing flags setup or
independent R1 and R2 records that are currently flagged as SE reads.
"""Resets mate pair information between reads in a pair.

Requires that both r1 and r2 are mapped. Can be handed reads that already have pairing flags
setup or independent R1 and R2 records that are currently flagged as SE reads.

Args:
r1: read 1
r2: read 2 with the same queryname as r1
r1: Read 1 (first read in the template).
r2: Read 2 with the same query name as r1 (second read in the template).
proper_pair: whether the pair is proper or not.
"""
assert not r1.is_unmapped, f"Cannot process unmapped mate {r1.query_name}/1"
assert not r2.is_unmapped, f"Cannot process unmapped mate {r2.query_name}/2"
assert r1.query_name == r2.query_name, "Attempting to pair reads with different qnames."
if r1.query_name != r2.query_name:
raise ValueError("Cannot pair reads with different query names!")

for r in [r1, r2]:
r.is_paired = True
r.is_proper_pair = proper_pair

r1.is_read1 = True
r1.is_read2 = False
r2.is_read2 = True
r2.is_read1 = False

for src, dest in [(r1, r2), (r2, r1)]:
dest.next_reference_id = src.reference_id
dest.next_reference_start = src.reference_start
dest.mate_is_reverse = src.is_reverse
dest.mate_is_unmapped = False
dest.set_tag("MC", src.cigarstring)

insert_size = isize(r1, r2)
r1.template_length = insert_size
r2.template_length = -insert_size
set_mate_info(r1=r1, r2=r2, is_proper_pair=lambda a, b: proper_pair)


@attr.s(frozen=True, auto_attribs=True)
Expand Down Expand Up @@ -971,6 +1226,10 @@
for rec in recs:
yield rec

def set_mate_info(self) -> "Template":
"""Reset all mate information on every record in a template."""
return set_mate_info_for_template(self)

Check warning on line 1231 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L1231

Added line #L1231 was not covered by tests

def write_to(
self,
writer: SamFile,
Expand Down Expand Up @@ -1029,6 +1288,31 @@
return Template.build(recs, validate=False)


def set_mate_info_for_template(
clintval marked this conversation as resolved.
Show resolved Hide resolved
template: Template,
is_proper_pair: Callable[[AlignedSegment, AlignedSegment], bool] = is_proper_pair,
) -> Template:
"""Reset all mate information on every record in a template.

Args:
template: The template of alignments to reset all mate information on.
is_proper_pair: A function that takes two alignments and determines proper pair status.
"""
if template.r1 is not None and template.r2 is not None:
set_mate_info(template.r1, template.r2, is_proper_pair=is_proper_pair)

Check warning on line 1302 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L1302

Added line #L1302 was not covered by tests
if template.r1 is not None:
for rec in template.r2_secondaries:
set_mate_info_on_secondary(rec, template.r1, is_proper_pair=is_proper_pair)

Check warning on line 1305 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L1305

Added line #L1305 was not covered by tests
for rec in template.r2_supplementals:
set_mate_info_on_supplementary(rec, template.r1)

Check warning on line 1307 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L1307

Added line #L1307 was not covered by tests
if template.r2 is not None:
for rec in template.r1_secondaries:
set_mate_info_on_secondary(rec, template.r2, is_proper_pair=is_proper_pair)

Check warning on line 1310 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L1310

Added line #L1310 was not covered by tests
for rec in template.r1_supplementals:
set_mate_info_on_supplementary(rec, template.r2)
return template

Check warning on line 1313 in fgpyo/sam/__init__.py

View check run for this annotation

Codecov / codecov/patch

fgpyo/sam/__init__.py#L1312-L1313

Added lines #L1312 - L1313 were not covered by tests


class SamOrder(enum.Enum):
"""
Enumerations of possible sort orders for a SAM file.
Expand Down
Loading
Loading