-
Notifications
You must be signed in to change notification settings - Fork 83
Prepare GTDB Tk data
- Get all genomes used to generate Archaeal and Bacterial tree:
cat raw/gtdb_bac_taxonomy.tsv |awk '{print $1}' > raw_genomes.lst
cat raw/gtdb_arc_taxonomy.tsv |awk '{print $1}' >> raw_genomes.lst
- Pull the fna files of those genomes in a folder:
gtdb genomes pull --batchfile raw_genomes.lst --genomic --output fastani/
- GTDB user genome IDs are replaced with a NCBI genome accession where available and a UBA ID otherwise.
python GtdbTK/scripts/rename_UBAs/renameUBAs.py convert_fastani_genomes --input preprocessed/fastani/ --output postprocessed/
- Archive all genomes in the postprocessed/fastani folder:
pgzip *.fna
Copy the create_genome_paths.sh script from the scripts folder to the fastani folder ( above database/) and run it
GTDB-Tk also need to untrimmed version of each MSA:
gtdb tree create --no_trim --no_tree --genome_batchfile raw_bacterial.lst --guaranteed_batchfile raw_bacterial.lst --output . --marker_set_ids 1
gtdb tree create --no_trim --no_tree --genome_batchfile raw_archaeal.lst --guaranteed_batchfile raw_archaeal.lst --output . --marker_set_ids 2
- GTDB user genome IDs are replaced with a NCBI genome accession where available and a UBA ID otherwise.
gtdb_release_tk msa_files bac120/gtdb_concatenated.faa ar122/gtdb_concatenated.faa gtdb_r89_metadata_20190612.tsv user_gid_table.tsv 89 postprocessed/`
- Copy new msa files to GTDB-Tk package
cp postprocessed/bac120_msa_r89.faa gtdbtk_package/msa/gtdb_r89_bac120.faa
cp postprocessed/ar122_msa_r89.faa gtdbtk_package/msa/gtdb_r89_ar122.faa
- Get the 2 dictionaries from outliers command and paste them in the metadata.txt file
- Edit version variable
Pplacer package are created by using the official tree and the official trimmed msa.
Optional: remove dummy node using gtdb_validation_tk.
gtdb_validation_tk remove_dummy gtdb_<release>_ar_curated.tree gtdb_<release>_ar_no_dummy.tree
- First step is to strip the taxonomy from the decorated tree:
genometreetk strip bac120_r89.tree bac120_r89_strippped.tree
genometreetk strip gtdb__ar_no_dummy.tree ar122_r89_strippped.tree
- Use Fasttree to generate a fitting log
FastTreeMP -wag -nome -mllen -intree bac120_r89_strippped.tree -log fitting_stats.log < bac120_msa_r89.faa > bac120_r89_fitted.tree
FastTreeMP -wag -nome -mllen -intree ar122_r89_strippped.tree -log fitting_stats.log < ar122_msa_r89.faa > ar122_r89_fitted.tree
- Redecorate the tree
phylorank decorate bac120_r89_fitted.tree bac120_taxonomy_r89.tsv bac120_r89_pplacer.tree
phylorank decorate ar122_r89_fitted.tree ar122_taxonomy_r89.tsv ar122_r89_pplacer.tree
- Calculate the dictionary of RED value used for GtdbTk
phylorank outliers bac120_r89_pplacer.tree bac120_taxonomy_r89.tsv bac120_outliers_children
phylorank outliers bac120_r89_pplacer.tree bac120_taxonomy_r89.tsv ar122_outliers_children {"phylum":0.305405949512,"class":0.459651489695,"order":0.63568945839,"family":0.777457279798,"genus":0.9300000}
- Unroot the tree
python GtdbTK/scripts/rename_UBAs/prepare_gtdbtk_package.py unroot_tree bac120_r89_pplacer.tree pplacer/bac120_r89_unroot.pplacer.tree
python GtdbTK/scripts/rename_UBAs/prepare_gtdbtk_package.py unroot_tree ar122_r89_pplacer.tree pplacer/ar122_r89_unroot.pplacer.tree
- Remove spaces from unroot.pplacer.tree
- Generate pkg folder:
taxit create -l gtdbtk.refpkg -P gtdbtk.refpkg --aln-fasta <msa_file> --tree-stats <fasttree_log_file> --tree-file <tree_file>
- Copy the pplacer package in GTDB-Tk data folder
cat sp_clusters.tsv | awk 'BEGIN {FS="\t"}; {printf ("%s\t%s\t%s\n", $2, $1, $4)}' > gtdb_radii.tsv
rename versions find . -type l -name 'ar*' -exec rename 's/86/86.1/' {} ;