Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Family Threshold #307

Open
ShuyiF opened this issue Dec 19, 2024 · 1 comment
Open

Family Threshold #307

ShuyiF opened this issue Dec 19, 2024 · 1 comment

Comments

@ShuyiF
Copy link

ShuyiF commented Dec 19, 2024

Hi!

Thank you for developing this efficient tool!

I was using PPanGGOLiN to build pangenome. I'm wondering is there any parameter related to 'protein family sequence identity threshold' I could tune, or what's the default setting for it? Compared with other pangenome tools, PPanGGOLiN gives me a higher number of gene families. So I am thinking maybe it is caused by the higher 'family threshold' setting in PPanGGOLiN?

Looking forward to your reply!

Thank you!

@JeanMainguy
Copy link
Member

Hi,

Thank you for your kind words and for using PPanGGOLiN!

Yes, there are parameters you can adjust related to the protein family clustering threshold. In both the ppanggolin all command and the ppanggolin cluster command (if you're running a step-by-step analysis), you’ll find these options:

  --coverage COVERAGE   Minimal coverage of the alignment for two proteins to be in the same cluster. Default: 0.8  
  --identity IDENTITY   Minimal identity percentage for two proteins to be in the same cluster. Default: 0.8  

By default, both the identity and coverage thresholds are set to 0.8, meaning 80% identity and 80% coverage. These values control the clustering process, which is performed using MMseqs2.

On top of that, PPanGGOLiN includes a defragmentation step after the initial clustering. This step helps to reassign fragmented genes to existing families, preventing them from being clustered alone. Check out the documentation for more detail on this step: https://ppanggolin.readthedocs.io/en/latest/user/PangenomeAnalyses/pangenomeAnalyses.html#defragmentation

The higher number of families may simply be due to differences in clustering strategies between the tools. By the way, which tool did you compare PPanGGOLiN with? Perhaps @ggautreau has additional insight on this topic.

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants