refactor(wip): building within_intron and spliced subsets #134

ns-rse · 2025-01-10T17:12:55Z

Closes #91
Closes #92

Extracts the duplicated code that builds subsets of reads that are within introns and within spliced_3ui and abstracts
out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and
perhaps add some additional fixtures to use here.

ns-rse · 2025-01-13T12:56:53Z

Hi @IanSudberry,

I've not sussed out why the tests fail yet but have been looking closely at some of the code and logic and have some
questions I was wondering you could help with.

In this PR I've noted...

            # Why add an empty list and append a tuple?
            # Should we not be getting the keys of within_intron to check transcript_id is not in?
            if transcript_id not in within_intron:
                within_intron[transcript_id] = []
            within_intron[transcript_id].append((chromosome, start, end, strand))

...and thinking it through I realised that the reason this might be so is if there is more than one transcript_id
coming out of a given pair_feature. If this is the case the resulting value for a given transcript_id is a list of
tuples. The following dummy example demonstrates this

first = (1, 2, 3, 4)
second = ("a", "b", "c", "d")

test = {}
test["first"] = []
test["first"].append(first)
test["first"].append(second)
test["first"]

[(1, 2, 3, 4), ('a', 'b', 'c', 'd')]

Reading through all_introns_counts_and_info.py I looked for where read1_within_intron / read2_within_intron are
subsequently used and there are two places.

Line 239 - `all_dicts`

This makes a list of four
dictionaries

            all_dicts = [read1_within_intron, read2_within_intron, read1_spliced_3UI, read2_spliced_3UI]

...but all_dicts is never used anywhere (not a problem for now but we can remove it during refactoring).

Lines 246-259 - `assign_conversions_to_retained`

The only other place that read1_within_intron / read2_within_intron are then used is building a set of
assign_conversions_to_retained on lines
246-259...

            # Create a set of tuples: (tx_id,(start,end))
            # Retained
            assign_conversions_to_retained = []

            for transcript_id, start_end_list in read1_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            for transcript_id, start_end_list in read2_within_intron.items():
                for start_end in start_end_list:
                    assign_conversions_to_retained.append((transcript_id, start_end))

            # Only need each unique element once
            assign_conversions_to_retained = set(assign_conversions_to_retained)

The for ... loop here uses .items() and so transcript_id is the key and start_end_list is a list, commonly of
length 1, but possibly longer and these are then collected into a list before being de-duplicated.

I'm wondering if we can build a set more directly so wanted to check I've understood the intention correctly which is to
get a set of unique transcript_id and associated chromosome,start,end,strand?

Closes #91 Closes #92 Extracts the duplicated code that builds subsets of reads that are within introns (or span introns) and within spliced_3ui and abstracts out to functions with tests. Currently the tests fail and I don't think they should so I need to work out why and perhaps add some additional fixtures to use here.

ns-rse marked this pull request as draft January 10, 2025 17:13

ns-rse added tests Testing issues refactor Refactoring labels Jan 10, 2025

ns-rse force-pushed the ns-rse/91-conditional-extraction branch from 743b339 to ce1c18d Compare January 13, 2025 13:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(wip): building within_intron and spliced subsets #134

refactor(wip): building within_intron and spliced subsets #134

ns-rse commented Jan 10, 2025

ns-rse commented Jan 13, 2025

refactor(wip): building within_intron and spliced subsets #134

Are you sure you want to change the base?

refactor(wip): building within_intron and spliced subsets #134

Conversation

ns-rse commented Jan 10, 2025

ns-rse commented Jan 13, 2025

Line 239 - all_dicts

Lines 246-259 - assign_conversions_to_retained

Line 239 - `all_dicts`

Lines 246-259 - `assign_conversions_to_retained`