Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(wip): building within_intron and spliced subsets #134

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ns-rse
Copy link
Contributor

@ns-rse ns-rse commented Jan 10, 2025

Closes #91
Closes #92

Extracts the duplicated code that builds subsets of reads that are within introns and within spliced_3ui and abstracts
out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and
perhaps add some additional fixtures to use here.

@ns-rse ns-rse marked this pull request as draft January 10, 2025 17:13
@ns-rse ns-rse added tests Testing issues refactor Refactoring labels Jan 10, 2025
@ns-rse
Copy link
Contributor Author

ns-rse commented Jan 13, 2025

Hi @IanSudberry,

I've not sussed out why the tests fail yet but have been looking closely at some of the code and logic and have some
questions I was wondering you could help with.

In this PR I've noted...

            # Why add an empty list and append a tuple?
            # Should we not be getting the keys of within_intron to check transcript_id is not in?
            if transcript_id not in within_intron:
                within_intron[transcript_id] = []
            within_intron[transcript_id].append((chromosome, start, end, strand))

...and thinking it through I realised that the reason this might be so is if there is more than one transcript_id
coming out of a given pair_feature. If this is the case the resulting value for a given transcript_id is a list of
tuples. The following dummy example demonstrates this

first = (1, 2, 3, 4)
second = ("a", "b", "c", "d")

test = {}
test["first"] = []
test["first"].append(first)
test["first"].append(second)
test["first"]

[(1, 2, 3, 4), ('a', 'b', 'c', 'd')]

Reading through all_introns_counts_and_info.py I looked for where read1_within_intron / read2_within_intron are
subsequently used and there are two places.

Line 239 - all_dicts

This makes a list of four
dictionaries

            all_dicts = [read1_within_intron, read2_within_intron, read1_spliced_3UI, read2_spliced_3UI]

...but all_dicts is never used anywhere (not a problem for now but we can remove it during refactoring).

Lines 246-259 - assign_conversions_to_retained

The only other place that read1_within_intron / read2_within_intron are then used is building a set of
assign_conversions_to_retained on lines
246-259
...

            # Create a set of tuples: (tx_id,(start,end))
            # Retained
            assign_conversions_to_retained = []

            for transcript_id, start_end_list in read1_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            for transcript_id, start_end_list in read2_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            # Only need each unique element once
            assign_conversions_to_retained = set(assign_conversions_to_retained)

The for ... loop here uses .items() and so transcript_id is the key and start_end_list is a list, commonly of
length 1, but possibly longer and these are then collected into a list before being de-duplicated.

I'm wondering if we can build a set more directly so wanted to check I've understood the intention correctly which is to
get a set of unique transcript_id and associated chromosome,start,end,strand?

Closes #91
Closes #92

Extracts the duplicated code that builds subsets of reads that are within introns (or span introns) and within
spliced_3ui and abstracts out to functions with tests. Currently the tests fail and I don't think they should so I need
to work out why and perhaps add some additional fixtures to use here.
@ns-rse ns-rse force-pushed the ns-rse/91-conditional-extraction branch from 743b339 to ce1c18d Compare January 13, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor Refactoring tests Testing issues
Projects
None yet
2 participants