Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

produce the --replicons input content based on the flye assembly_info.txt #256

Open
splaisan opened this issue Oct 31, 2023 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@splaisan
Copy link

The tool works great using docker but the naming is not very nice when starting from a flye assembly where contigs are named randomly.

I found the use of --replicons great in that regard but it requires to create a tsv file upfront which is not easy when looping through many assemblies in an integrated pipeline (dozens of assemblies in a row).

Would it be possible to internally create the --replicons input file based on the content of the flye assembly_info.txt file which contains columns #seq_name length cov. circ. and taking the largest contig as chromosome and the others as plasmids?

That would create genbank files which are closer to submission quality

My fix now is to re-run the bakta after all genomes are assembled and after building the --replicons input file by hand

thanks

@splaisan splaisan added the enhancement New feature or request label Oct 31, 2023
@oschwengers
Copy link
Owner

Hi @splaisan ,
thanks for reaching out and asking. Indeed, it would be very nice if Bakta were able to instantly use circularity information from Flye. Actually, this is already possible for Unicycler assemblies from which Bakta extracts circularity information from the Fasta headers. So, whenever an assembled sequence has a circular=true tag in its Fasta header description, Bakta will use that information in the annotation process and output files.

I totally see your point here and I'd like very much to address this. However, I'm a bit reluctant to address this by Flye-specific paramters as there are other assemblers which would soon mess up Bakta's usage. I guess, the better approach would be to ask the Fyle developers to put the required information into the Fasta header, so that Bakta can use the apprach that is already implemented. In addition, this would have the nice bonus, that circularity information on sequences produced by Flye would be stored along with the sequences themselves, instead of additional txt files w/o standardized format. To this end, I've opened an issue in the Flye repo: mikolmogorov/Flye#647
Maybe, you would like to endorse this?

@splaisan
Copy link
Author

Can you please give an example of a fasta header that would work.
It is really easy to add a script in between to adapt the flye headers and make them compatible, when i have this done I will share it (bash / bioawk most likely) in the issue page.
Thanks a lot for your info

@oschwengers
Copy link
Owner

Sure. This is a recent example from a Unicycler assembly:
>1 length=4635742 depth=1.00x circular=true

In this case, Bakta is able to extract this information and mark this sequence as complete and circular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants