Fix error when using an external clustering #278

JeanMainguy · 2024-09-03T11:36:08Z

This PR addresses a few issues that users encountered when running the cluster command with an external cluster file.

Fixes made:

Avoid unnecessary gene sequence loading: Previously, gene sequences were being requested and loaded even when using an external cluster file. This was a problem if the pangenome didn’t have sequences.
Ensure correct data types for gene and family IDs: When gene IDs were purely numeric, pandas was reading them as integers, causing mismatches since gene and family IDs should be treated as strings. We now explicitly ensure these columns are read as strings, avoiding any confusion.
Clear error for missing genes: If a gene from the cluster file wasn’t found in the pangenome, the code didn’t raise an error immediately. This led to confusing errors later. Now a clear error is raised.
Clear error for singleton: When a singleton family is created and that a family with the same name is already contained in the pangenome, an unclear error was raised. Now the error states clearly the problem.
Fix inconsistencies in documentation: The column order for external clustering described in the documentation was inconsistent with the implementation in the code, as highlighted in issue Can't use external clustering: "Exception: Representative gene has not been set" #279. This is now fixed. The correct order is: family_id gene_id representative id. The families_tsv documentation was also out of date.

axbazin

In the documentation, we do not discuss the case where we provide 3 columns where it is "family_id gene_id fragmentation" in the doc. Is this voluntary and we have it only to support the old format, or do we want users to use this possibility of giving external clusterings with the info of fragmentation, but without the info of family representative?

docs/user/PangenomeAnalyses/pangenomeStat.md

axbazin · 2024-09-11T13:41:50Z

ppanggolin/cluster/cluster.py

@@ -447,11 +498,19 @@ def read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singlet
    :param disable_bar: Allow to disable progress bar
    """
    check_pangenome_former_clustering(pangenome, force)
-    check_pangenome_info(pangenome, need_annotations=True, need_gene_sequences=True, disable_bar=disable_bar)
+
+    if pangenome.status["geneSequences"] == "No":


since we're in a "read external cluster' case, shouldn't we just not load gene sequences even if there are some?

This is used to add protein sequences of representative genes at the end.
It's used when context and fasta commands are apply to the pangenome I believe.
Since pangenomes made with internal clustering already contain these sequences, I guess it is better to also have them in pangenomes generated with external clustering when it is possible to ensures that both types of pangenomes behave consistently...

<path>/ppanggolin/cluster/cluster.py:440: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

JeanMainguy · 2024-09-13T08:01:49Z

I updated the documentation to clarify that we are also taking cluster file made of 3 columns with : famille, gene and fragmentation.

JeanMainguy added 2 commits September 3, 2024 13:17

fix type of dataframe and raise when gene not found in pangenome

a6e6cd7

improve error message

143e0d4

JeanMainguy changed the base branch from master to dev September 3, 2024 11:36

JeanMainguy marked this pull request as draft September 3, 2024 11:38

JeanMainguy mentioned this pull request Sep 5, 2024

Can't use external clustering: "Exception: Representative gene has not been set" #279

Closed

JeanMainguy added 9 commits September 5, 2024 15:59

fix doc inconsistencies

0f2cbda

fix families_tsv documentation

5061f36

improve error message

d23aacb

improve error message and better management of singleton

c850e62

make flag more explicit

662fc26

improve error message

92bd359

load seq when necessary and use compress tmp file with dna to aa rep seq

3989258

improve error message

f7e3994

prevent putative error

c371cef

JeanMainguy marked this pull request as ready for review September 10, 2024 13:53

axbazin reviewed Sep 11, 2024

View reviewed changes

JeanMainguy added 3 commits September 12, 2024 16:55

Use dtype to prevent pandas warning

e422c68

<path>/ppanggolin/cluster/cluster.py:440: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.

rm mmseq format ref as it is not true anymore

dcb30e5

clarify external cluster doc

626d4d0

axbazin merged commit a46a2ef into dev Sep 13, 2024
5 checks passed

axbazin deleted the fix_read_clustering_file branch September 13, 2024 08:51

JeanMainguy mentioned this pull request Sep 13, 2024

Merge dev branch into master to release version 2.1.2 #282

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix error when using an external clustering #278

Fix error when using an external clustering #278

JeanMainguy commented Sep 3, 2024 •

edited

Loading

axbazin left a comment

axbazin Sep 11, 2024

JeanMainguy Sep 12, 2024

JeanMainguy commented Sep 13, 2024

Fix error when using an external clustering #278

Fix error when using an external clustering #278

Conversation

JeanMainguy commented Sep 3, 2024 • edited Loading

axbazin left a comment

Choose a reason for hiding this comment

axbazin Sep 11, 2024

Choose a reason for hiding this comment

JeanMainguy Sep 12, 2024

Choose a reason for hiding this comment

JeanMainguy commented Sep 13, 2024

JeanMainguy commented Sep 3, 2024 •

edited

Loading