Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question on COG annotation - Fewer number of CDSs seems to be assigned to COGs when compared to NCBI CD-Search #350

Closed
ilnamkang opened this issue Dec 5, 2024 · 2 comments
Labels
question Further information is requested

Comments

@ilnamkang
Copy link

Hi,

Thanks for a great tool!

I've compared COG annotation of Bakta with NCBI CD-Search (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) results for a complete bacterial genome with 1,360 CDSs.

NCBI CD-Search assigned COGs to ~1,200 CDSs, while Bakta assigned COGs to only ~150 CDSs.
(Note that I just used the grep command for tsv, gff3, and json files from Bakta, to check the number of CDSs having COG annotation.)

Is this difference usual and/or expected?

Thanks.

@oschwengers oschwengers added the question Further information is requested label Dec 5, 2024
@oschwengers
Copy link
Owner

Hi and thanks a lot for asking.

Yes, this is to some extend expected because of two things:

  1. NCBI's CD-Search service uses PSSMs to search against your query protein sequences. In general, this is a more sensitive search than Bakta's default (DIAMOND fast).
  2. Thanks to your question, I realized that there is a COG 2024 update adding ~140 COG clusters compared to the 2020 update that Bakta uses.

So, for the next Bakta database update we will definitely use the novel 2024 COG update surely adding a couple good annotations. Also, we will think about how to improve the sensitivity during our pre-annotation of Bakta's db.

But, until then, you could try Bakta's new feature of accepting user-provided HMM models (if you have some).

@ilnamkang
Copy link
Author

Thank you for a detailed explanation.

In my humble opinion, using the COG annotation provided by the eggNOG pipeline might be one of the options to improve COG annotation quality, if you incorporate the eggNOG in Bakta in future (#325).

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants