Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix error when using an external clustering #278

Merged
merged 14 commits into from
Sep 13, 2024
Merged

Conversation

JeanMainguy
Copy link
Member

@JeanMainguy JeanMainguy commented Sep 3, 2024

This PR addresses a few issues that users encountered when running the cluster command with an external cluster file.

Fixes made:

  • Avoid unnecessary gene sequence loading: Previously, gene sequences were being requested and loaded even when using an external cluster file. This was a problem if the pangenome didn’t have sequences.
  • Ensure correct data types for gene and family IDs: When gene IDs were purely numeric, pandas was reading them as integers, causing mismatches since gene and family IDs should be treated as strings. We now explicitly ensure these columns are read as strings, avoiding any confusion.
  • Clear error for missing genes: If a gene from the cluster file wasn’t found in the pangenome, the code didn’t raise an error immediately. This led to confusing errors later. Now a clear error is raised.
  • Clear error for singleton: When a singleton family is created and that a family with the same name is already contained in the pangenome, an unclear error was raised. Now the error states clearly the problem.
  • Fix inconsistencies in documentation: The column order for external clustering described in the documentation was inconsistent with the implementation in the code, as highlighted in issue Can't use external clustering: "Exception: Representative gene has not been set" #279. This is now fixed. The correct order is: family_id gene_id representative id. The families_tsv documentation was also out of date.

@JeanMainguy JeanMainguy marked this pull request as ready for review September 10, 2024 13:53
Copy link
Member

@axbazin axbazin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the documentation, we do not discuss the case where we provide 3 columns where it is "family_id gene_id fragmentation" in the doc. Is this voluntary and we have it only to support the old format, or do we want users to use this possibility of giving external clusterings with the info of fragmentation, but without the info of family representative?

docs/user/PangenomeAnalyses/pangenomeStat.md Outdated Show resolved Hide resolved
@@ -447,11 +498,19 @@ def read_clustering(pangenome: Pangenome, families_tsv_path: Path, infer_singlet
:param disable_bar: Allow to disable progress bar
"""
check_pangenome_former_clustering(pangenome, force)
check_pangenome_info(pangenome, need_annotations=True, need_gene_sequences=True, disable_bar=disable_bar)

if pangenome.status["geneSequences"] == "No":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're in a "read external cluster' case, shouldn't we just not load gene sequences even if there are some?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to add protein sequences of representative genes at the end.
It's used when context and fasta commands are apply to the pangenome I believe.
Since pangenomes made with internal clustering already contain these sequences, I guess it is better to also have them in pangenomes generated with external clustering when it is possible to ensures that both types of pangenomes behave consistently...

<path>/ppanggolin/cluster/cluster.py:440: DtypeWarning:

Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
@JeanMainguy
Copy link
Member Author

I updated the documentation to clarify that we are also taking cluster file made of 3 columns with : famille, gene and fragmentation.

@axbazin axbazin merged commit a46a2ef into dev Sep 13, 2024
5 checks passed
@axbazin axbazin deleted the fix_read_clustering_file branch September 13, 2024 08:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants