You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NCBI CD-Search assigned COGs to ~1,200 CDSs, while Bakta assigned COGs to only ~150 CDSs.
(Note that I just used the grep command for tsv, gff3, and json files from Bakta, to check the number of CDSs having COG annotation.)
Is this difference usual and/or expected?
Thanks.
The text was updated successfully, but these errors were encountered:
Yes, this is to some extend expected because of two things:
NCBI's CD-Search service uses PSSMs to search against your query protein sequences. In general, this is a more sensitive search than Bakta's default (DIAMOND fast).
Thanks to your question, I realized that there is a COG 2024 update adding ~140 COG clusters compared to the 2020 update that Bakta uses.
So, for the next Bakta database update we will definitely use the novel 2024 COG update surely adding a couple good annotations. Also, we will think about how to improve the sensitivity during our pre-annotation of Bakta's db.
But, until then, you could try Bakta's new feature of accepting user-provided HMM models (if you have some).
In my humble opinion, using the COG annotation provided by the eggNOG pipeline might be one of the options to improve COG annotation quality, if you incorporate the eggNOG in Bakta in future (#325).
Hi,
Thanks for a great tool!
I've compared COG annotation of Bakta with NCBI CD-Search (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) results for a complete bacterial genome with 1,360 CDSs.
NCBI CD-Search assigned COGs to ~1,200 CDSs, while Bakta assigned COGs to only ~150 CDSs.
(Note that I just used the grep command for tsv, gff3, and json files from Bakta, to check the number of CDSs having COG annotation.)
Is this difference usual and/or expected?
Thanks.
The text was updated successfully, but these errors were encountered: