-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Alignment
class
#41
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good, a number of comments to address and then I'll take one final look!
@@ -0,0 +1,232 @@ | |||
from typing import Callable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd move this to fgpyo/alignment.py
and then move the tests similarly. Any reason to keep it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to add module docs. Can you also add an example or two of the padded string method? See examples in other modules:
Lines 16 to 49 in 5ef0f5e
Examples | |
~~~~~~~~ | |
.. code-block:: python | |
>>> from fgpyo.read_structure import ReadStructure | |
>>> rs = ReadStructure.from_string("75T8B75T") | |
>>> [str(segment) for segment in rs] | |
'75T', '8B', '75T' | |
>>> rs[0] | |
ReadSegment(offset=0, length=75, kind=<SegmentType.Template: 'T'>) | |
>>> rs = rs.with_variable_last_segment() | |
>>> [str(segment) for segment in rs] | |
['75T', '8B', '+T'] | |
>>> rs[-1] | |
ReadSegment(offset=83, length=None, kind=<SegmentType.Template: 'T'>) | |
>>> rs = ReadStructure.from_string("1B2M+T") | |
>>> [s.bases for s in rs.extract("A"*6)] | |
['A', 'AA', 'AAA'] | |
>>> [s.bases for s in rs.extract("A"*5)] | |
['A', 'AA', 'AA'] | |
>>> [s.bases for s in rs.extract("A"*4)] | |
['A', 'AA', 'A'] | |
>>> [s.bases for s in rs.extract("A"*3)] | |
['A', 'AA', ''] | |
>>> rs.template_segments() | |
(ReadSegment(offset=3, length=None, kind=<SegmentType.Template: 'T'>),) | |
>>> [str(segment) for segment in rs.template_segments()] | |
['+T'] | |
>>> try: | |
... ReadStructure.from_string("23T2TT23T") | |
... except ValueError as ex: | |
... print(str(ex)) | |
... | |
Read structure missing length information: 23T2T[T]23T |
from typing import Callable | |
""" | |
Classes for representing Alignments | |
----------------------------------- | |
<INSERT an example or TWO HERE> | |
Module Contents | |
~~~~~~~~~~~~~~~ | |
The module contains the following public classes: | |
- :class:`~fgpyo.alignment.Alignment` -- Describes the alignment between two sequences | |
""" | |
from typing import Callable |
target: the target sequence | ||
query_start: the 1-based position in the query sequence where the alignment begins | ||
target_start: the 1-based position in the target sequence where the alignment begins | ||
cigar: a Cigar object describing the alignment of the two sequences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend not to use "class" or "object" in docstrings since they we know they are classes or objects (since that's what we're document)
cigar: a Cigar object describing the alignment of the two sequences | |
cigar: the Cigar describing the alignment of the two sequences |
|
||
class Alignment: | ||
""" | ||
A general class to describe the alignment between two sequences or partial ranges thereof |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general class to describe the alignment between two sequences or partial ranges thereof | |
Describes the alignment between two sequences. |
""" | ||
return self.target_start + self.cigar.length_on_target() - 1 | ||
|
||
def padded_string( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the main reason this class is here, so I'd highlight it in the module docs in an example
mismatch_char: str = ".", | ||
gap_char: str = " ", | ||
pad_char: str = "-", | ||
) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer a tuple when we have a fixed # of items returned (or use a namedtuple/dataclass, but a tuple is best here). That allows us to unpack on assignment: q, a, t = alignment.padded_string(...)
) -> List[str]: | |
) -> Tuple[str, str, str]: |
if not ( | ||
self.query_start <= start <= self.query_end | ||
and self.query_start <= end <= self.query_end | ||
): | ||
raise ValueError("start or end is outside of aligned region of target sequence") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's more useful to have the conditions separated so you know if it start or end that's invalid. And why not use assert
?
if not ( | |
self.query_start <= start <= self.query_end | |
and self.query_start <= end <= self.query_end | |
): | |
raise ValueError("start or end is outside of aligned region of target sequence") | |
assert self.query_start <= start <= self.query_end, f"start '{start}' is outside of aligned region of query sequence " | |
assert self.query_start <= end <= self.query_end, f"end '{end}' is outside of aligned region of query sequence " |
Also, we carried over a bug from fgbio (it's query, not target): fulcrumgenomics/fgbio#914
Returns: | ||
A new Alignment with updated coordinates and cigar. | ||
""" | ||
if not ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
elems.append(CigarElement(length, elem.operator)) | ||
curr_start += length | ||
|
||
return Alignment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you make this an attr class, you can do attr.evolve
:)
start, end, self.target_start, lambda elem: elem.operator.consumes_reference | ||
) | ||
|
||
def sub( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make "private" as the docstring suggest
def sub( | |
def _sub( |
@@ -332,7 +332,7 @@ class CigarOp(enum.Enum): | |||
code (int): The :py:mod:`~pysam` cigar operator code. | |||
character (int): The single character cigar operator. | |||
consumes_query (bool): True if this operator consumes query bases, False otherwise. | |||
consumes_target (bool): True if this operator consumes target bases, False otherwise. | |||
consumes_reference (bool): True if this operator consumes target bases, False otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consumes_reference (bool): True if this operator consumes target bases, False otherwise. | |
consumes_reference (bool): True if this operator consumes reference bases, False otherwise. |
Replicated the
Alignment
class fromfgbio