Skip to content

Prepare GTDB Tk data

Pierre Chaumeil edited this page Feb 17, 2022 · 14 revisions

Prepare FastANI database

  • Get all genomes used to generate Archaeal and Bacterial tree:

cat raw/gtdb_bac_taxonomy.tsv |awk '{print $1}' > raw_genomes.lst
cat raw/gtdb_arc_taxonomy.tsv |awk '{print $1}' >> raw_genomes.lst

  • Pull the fna files of those genomes in a folder:

gtdb genomes pull --batchfile raw_genomes.lst --genomic --output fastani/

  • GTDB user genome IDs are replaced with a NCBI genome accession where available and a UBA ID otherwise.

python GtdbTK/scripts/rename_UBAs/renameUBAs.py convert_fastani_genomes --input preprocessed/fastani/ --output postprocessed/

  • Archive all genomes in the postprocessed/fastani folder:

pgzip *.fna

Prepare genome_paths.tsv

Copy the create_genome_paths.sh script from the scripts folder to the fastani folder ( above database/) and run it

Prepare untrimmed MSA

GTDB-Tk also need to untrimmed version of each MSA:

gtdb tree create --no_trim --no_tree --genome_batchfile raw_bacterial.lst --guaranteed_batchfile raw_bacterial.lst --output . --marker_set_ids 1
gtdb tree create --no_trim --no_tree --genome_batchfile raw_archaeal.lst --guaranteed_batchfile raw_archaeal.lst --output . --marker_set_ids 2

  • GTDB user genome IDs are replaced with a NCBI genome accession where available and a UBA ID otherwise.

gtdb_release_tk msa_files bac120/gtdb_concatenated.faa ar122/gtdb_concatenated.faa gtdb_r89_metadata_20190612.tsv user_gid_table.tsv 89 postprocessed/`

  • Copy new msa files to GTDB-Tk package

cp postprocessed/bac120_msa_r89.faa gtdbtk_package/msa/gtdb_r89_bac120.faa
cp postprocessed/ar122_msa_r89.faa gtdbtk_package/msa/gtdb_r89_ar122.faa

Create Metadata document

  • Get the 2 dictionaries from outliers command and paste them in the metadata.txt file
  • Edit version variable

Create pplacer Package:

Pplacer package are created by using the official tree and the official trimmed msa.

Optional: remove dummy node using gtdb_validation_tk.
gtdb_validation_tk remove_dummy gtdb_<release>_ar_curated.tree gtdb_<release>_ar_no_dummy.tree

  • First step is to strip the taxonomy from the decorated tree:

genometreetk strip bac120_r89.tree bac120_r89_strippped.tree
genometreetk strip gtdb__ar_no_dummy.tree ar122_r89_strippped.tree

  • Use Fasttree to generate a fitting log

FastTreeMP -wag -nome -mllen -intree bac120_r89_strippped.tree -log fitting_stats.log < bac120_msa_r89.faa > bac120_r89_fitted.tree
FastTreeMP -wag -nome -mllen -intree ar122_r89_strippped.tree -log fitting_stats.log < ar122_msa_r89.faa > ar122_r89_fitted.tree

  • Redecorate the tree

phylorank decorate bac120_r89_fitted.tree bac120_taxonomy_r89.tsv bac120_r89_pplacer.tree
phylorank decorate ar122_r89_fitted.tree ar122_taxonomy_r89.tsv ar122_r89_pplacer.tree

  • Calculate the dictionary of RED value used for GtdbTk

phylorank outliers bac120_r89_pplacer.tree bac120_taxonomy_r89.tsv bac120_outliers_children
phylorank outliers bac120_r89_pplacer.tree bac120_taxonomy_r89.tsv ar122_outliers_children {"phylum":0.305405949512,"class":0.459651489695,"order":0.63568945839,"family":0.777457279798,"genus":0.9300000}

  • Unroot the tree

python GtdbTK/scripts/rename_UBAs/prepare_gtdbtk_package.py unroot_tree bac120_r89_pplacer.tree pplacer/bac120_r89_unroot.pplacer.tree
python GtdbTK/scripts/rename_UBAs/prepare_gtdbtk_package.py unroot_tree ar122_r89_pplacer.tree pplacer/ar122_r89_unroot.pplacer.tree

  • Remove spaces from unroot.pplacer.tree
  • Generate pkg folder:

taxit create -l gtdbtk.refpkg -P gtdbtk.refpkg --aln-fasta <msa_file> --tree-stats <fasttree_log_file> --tree-file <tree_file>

  • Copy the pplacer package in GTDB-Tk data folder

Prepare gtdb_radii file

cat sp_clusters.tsv | awk 'BEGIN {FS="\t"}; {printf ("%s\t%s\t%s\n", $2, $1, $4)}' > gtdb_radii.tsv

Misc commands

rename versions find . -type l -name 'ar*' -exec rename 's/86/86.1/' {} ;

Clone this wiki locally