-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't use external clustering: "Exception: Representative gene has not been set" #279
Comments
On my machine, the example in the git repository works perfectly like this: So the problem seems to really come from my clustering file Thanks, |
Hi, Thanks for raising this issue! It seems like the problem might be due to how the genes are named in your clustering table, which doesn’t match the way PPanGGOLiN expects them. PPanGGOLiN uses the gene ID from the
The ID would be At the end of the annotation step, PPanGGOLiN checks if all gene IDs are unique. If they aren’t, it uses internal IDs in the format To check if PPanGGOLiN used the annotation file's IDs or generated its own, you can run this command:
If you see In your case, it looks like the genes in your clustering table follow the pattern |
I understand that working with external clustering files can be tricky, especially when PPanGGOLiN uses its own internal IDs. A possible workaround is to run the clustering step with PPanGGOLiN and then generate the This file will list the gene family ID, gene ID, and local ID (which corresponds to the ID in the GFF file). Essentially, the second and third columns will help you map the internal IDs to the CDS IDs from the annotation file. To sum up the commands would be:
|
About the error you got, this is quite misleading. Thank you as well for pointing out the inconsistencies in the documentation—I'll fix them. (I've also noticed that the documentation for the |
Hi, Thanks for your quick reply ! So I renamed my proteins like this Maybe something is wrong with the format of the names ? Because my IDs are recognized as unique by ppanggolin according to the log: I can try your workaround, I'll let you know if that works, last case scenario I can just run it without providing my clustering results, but I'm trying to save time as I have a huge dataset Thanks a lot ! |
I have found the problem, after using the workaround and generating new clusters from the GFF files, I compared So the problem was the way that I generated my GFFs which omitted some proteins, and not ppanggolin or the protein IDs. Thanks, sorry for this mistake, I can add another comment whenever I succeed with new GFF files |
It ended up working with my new GFF files, thanks for the workaround that helped me debug ! |
Hello !
I'm trying to use my external clustering results with my dataset like this:
ppanggolin workflow --anno list_gff.tsv -c 15 -o real_test --clusters clusters.tsv --infer_singletons
But I get this error:
File "/clusterfs/jgi/groups/science/homes/eolondela/.micromamba/envs/ppanggo/lib/python3.12/site-packages/ppanggolin/geneFamily.py", line 198, in representative raise Exception("Representative gene has not been set")
I'm using GFF3 files as an input and I chose to provide the three column clustering file as described in the documentation, so the representative genes are indicated in
clusters.tsv
. I get the same error if I try the two column file and let ppanggolin take the first gene of the cluster as the representative.PS: In the documentation it is said that the representative should be the second column, but it is actually the last column as I found out from the
cluster.py
script. When I strictly followed the documentation, the error stated that protein IDs were duplicated.I can provide more files if needed. The clustering file is attached, and the same command works if I don't input my clustering.
clusters.tsv.zip
The last lines from the log file were:
Thanks in advance,
Eric
The text was updated successfully, but these errors were encountered: