You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The tool works great using docker but the naming is not very nice when starting from a flye assembly where contigs are named randomly.
I found the use of --replicons great in that regard but it requires to create a tsv file upfront which is not easy when looping through many assemblies in an integrated pipeline (dozens of assemblies in a row).
Would it be possible to internally create the --replicons input file based on the content of the flye assembly_info.txt file which contains columns #seq_name length cov. circ. and taking the largest contig as chromosome and the others as plasmids?
That would create genbank files which are closer to submission quality
My fix now is to re-run the bakta after all genomes are assembled and after building the --replicons input file by hand
thanks
The text was updated successfully, but these errors were encountered:
Hi @splaisan ,
thanks for reaching out and asking. Indeed, it would be very nice if Bakta were able to instantly use circularity information from Flye. Actually, this is already possible for Unicycler assemblies from which Bakta extracts circularity information from the Fasta headers. So, whenever an assembled sequence has a circular=true tag in its Fasta header description, Bakta will use that information in the annotation process and output files.
I totally see your point here and I'd like very much to address this. However, I'm a bit reluctant to address this by Flye-specific paramters as there are other assemblers which would soon mess up Bakta's usage. I guess, the better approach would be to ask the Fyle developers to put the required information into the Fasta header, so that Bakta can use the apprach that is already implemented. In addition, this would have the nice bonus, that circularity information on sequences produced by Flye would be stored along with the sequences themselves, instead of additional txt files w/o standardized format. To this end, I've opened an issue in the Flye repo: mikolmogorov/Flye#647
Maybe, you would like to endorse this?
Can you please give an example of a fasta header that would work.
It is really easy to add a script in between to adapt the flye headers and make them compatible, when i have this done I will share it (bash / bioawk most likely) in the issue page.
Thanks a lot for your info
The tool works great using docker but the naming is not very nice when starting from a flye assembly where contigs are named randomly.
I found the use of --replicons great in that regard but it requires to create a tsv file upfront which is not easy when looping through many assemblies in an integrated pipeline (dozens of assemblies in a row).
Would it be possible to internally create the --replicons input file based on the content of the flye
assembly_info.txt
file which contains columns#seq_name length cov. circ.
and taking the largest contig as chromosome and the others as plasmids?That would create genbank files which are closer to submission quality
My fix now is to re-run the bakta after all genomes are assembled and after building the --replicons input file by hand
thanks
The text was updated successfully, but these errors were encountered: