-
Notifications
You must be signed in to change notification settings - Fork 33
Home
== PhyloPhlAn: microbial Tree of Life using 400 universal proteins ==
PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.
The main features of PhyloPhlAn are:
- completely automatic, as the user needs only to provide the (unannotated) protein sequences of the input genomes (as multifasta files of peptides - not nucleotides)
- very high topological accuracy and resolution because of the use of up to 400 previously identified most conserved proteins
- the possibility of integrating new genomes in the already reconstructed most comprehensive tree of life (3,171 microbial genomes)
- taxonomy estimation for the newly inserted genomes
- taxonomic curation for the produced phylogenetic trees
== new PhyloPhlAn implementation [alpha version] ==
We are developing a new version of PhyloPhlAn and https://bitbucket.org/nsegata/phylophlan/wiki/phylophlan2 you can find the new PhyloPhlAn wiki page.
Please note that it is still an alpha release available in the {{{dev}}} branch of the repository.
== Obtaining PhyloPhlAn ==
PhyloPhlAn can be https://bitbucket.org/nsegata/phylophlan/get/default.tar.gz or accessed from our https://bitbucket.org/nsegata/phylophlan.
PhyloPhlAn can also be obtained using http://mercurial.selenic.com/ as follows: {{{ $ hg clone https://bitbucket.org/nsegata/phylophlan }}}
The package can also be downloaded as a compressed file in https://bitbucket.org/nsegata/phylophlan/get/default.zip, and https://bitbucket.org/nsegata/phylophlan/get/default.tar.bz2 formats.
PhyloPhlAn has been developed and tested on Unix-based systems. On Windows or Mac systems, PhyloPhlAn may require some tweaking.
== Citing PhyloPhlAn == If you find the software or methodology useful, please cite the accompanying manuscript:
http://www.ncbi.nlm.nih.gov/pubmed/23942190 \ //Nicola Segata, Daniela Börnigen, Xochitl C. Morgan, and Curtis Huttenhower.// \ Nature Communications 4, 2013
You can download PhyloPhlAn's https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.nwk (with bootstrapping support) in which the genome labels are encoded with http://img.jgi.doe.gov/cgi-bin/w/main.cgi taxon ID (prefixed with 't'). The same tree with leaf nodes annotated with labels for https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.spe_labels.nwk, https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.gen_labels.nwk, https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.fam_labels.nwk, and https://bitbucket.org/nsegata/phylophlan/wiki/bs_tree.reroot.phy_labels.nwk are available. In addition, we provide the https://bitbucket.org/nsegata/phylophlan/src/ee2e2ed911c8/data/ppaalns/ppa.aln.tar.bz2, and the https://bitbucket.org/nsegata/phylophlan/wiki/ppafull.aln.faa.tar.bz2.
The image below reports the comprehensive, automated, and high-resolution microbial tree of life with taxonomic annotations obtained with PhyloPhlAn. It contains a total of 3,737 microbial genomes
{{https://bitbucket.org/nsegata/phylophlan/wiki/phylophlan.png|PhyloPhlAn}}
A high-resolution version of this image can be downloaded .
== Updates and mailing list == Software updates will be posted on the https://bitbucket.org/nsegata/phylophlan/. You are more than welcome to use the https://bitbucket.org/nsegata/phylophlan/issues on Bitbucket (or email [[mailto:[email protected]|us]]) to provide feedback, report bugs, and suggest/request new features.
If you questions and comments or you would like to be notified about new version, new features, or any other news related to PhyloPhlAn please join our mailing list:
https://groups.google.com/d/forum/phylophlan-users
== Common commands and examples ==
==== //"De novo" phylogenetic tree building with any sets of genomes// ==== If you would like to build a phylogenetic tree using any set of private or public genomes all you need to do is creating a folder in the {{{input}}} folder and copy inside one multifasta file (with extension ".faa") for each genome containing the peptidic sequences. If you call this folder "my_genomes" here is the command you need to call: {{{ #!bash $ ./phylophlan.py -u my_genomes }}} when finished, the resulting tree will appear in the {{{output/my_genomes}}} folder.
==== //Example 1: Corynebacterium "de novo" phylogenetic tree building// ==== You can try out this operation ({{{-u}}}) using an example included in the PhyloPhlAn package you downloaded called {{{example_corynebacteria}}} and stored in the {{{input}}} folder. In contains a protein multifasta file for each of the 30 genomes available for the http://wikipedia.org/wiki/Corynebacterium as February 2012 plus two http://wikipedia.org/wiki/Streptomyces genomes as a meaningful outgroup. As mentioned above, the command for obtaining the phylogenetic tree is: {{{ #!bash $ ./phylophlan.py -u example_corynebacteria --nproc 4 }}} Using 4 threads (specified with {{{--nproc 4}}}) this operation should take no more than 4-5 minutes, but even using one processor only (default) should give you the results in 10 minutes or so.
In the {{{output/example_corynebacteria/}}} folder you'll find a http://en.wikipedia.org/wiki/Newick_format file of the resulting tree as provided by http://www.microbesonline.org/fasttree/, and a http://en.wikipedia.org/wiki/PhyloXML file containing the same tree rerooted with a procedure which tries to maximize the distance from the root to any leaf. The two files are available for download (, and can be inspected with http://en.wikipedia.org/wiki/List_of_phylogenetic_tree_visualization_software and drawn with https://bitbucket.org/nsegata/graphlan. Figure 3B in the http://www.ncbi.nlm.nih.gov/pubmed/xxxxxx reports and discuss this example.
Also the full three of life reported above has been originally generated in this way. Notice that the concatenated alignment used to generate the tree with FastTree is stored in {{{data/example_corynebacteria/aln.fna}}} and can be used as input for other phylogenetic reconstruction software such as https://github.com/stamatak/standard-RAxML or http://www.megasoftware.net/ among http://en.wikipedia.org/wiki/List_of_phylogenetics_software.
==== //Inserting new genomes to the tree of life// ====
PhyloPhlAn let you insert a genome (or a set of genomes) into the already built microbial tree of life (containing >3,000 genomes, see figure and tree files above). Also in this case you need to create a dedicated folder (e.g. {{{my_genomes_to_insert}}}) in the {{{input}}} folder to store the protein multifasta files of interest. The command is: {{{ #!bash $ ./phylophlan.py -i my_genomes_to_insert --nproc 16 }}} If possible, we would recommend to use as many threads as possible ({{{--nproc}}}) because this operation is quite computationally demanding as it requires the alignments with other 3,000 genomes to be updated and the full tree of life to be rebuilt.
The resulting tree file {{{output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk}}} can be inspected with tree visualization software to check where the new genomes are rooted and their relations with already well characterized strains.
==== //Example 2: inserting Lactobacillus and Sulfolobus genomes into the tree of life// ====
As an example of insertion, we included in the {{{input}}} folder contained in the PhyloPhlAn package, three genomes recently sequenced and not yet included into the PhyloPhlAn tree and repository. These are two http://wikipedia.org/wiki/Lactobacillus and one http://wikipedia.org/wiki/Sulfolobus genomes available in IMG (accessions http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2511231185, http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2519899592, and http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=TaxonDetail&page=taxonDetail&taxon_oid=2524023197 respectively). {{{ #!bash $ ./phylophlan.py -i example_insertion --nproc 16 }}} The resulting file {{{example_insertion.tree.int.nwk}}} now contains the thousands of genomes in the PhyloPhlAn repository as well as the three "new" genomes.
==== //Imputing taxonomic labels for newly integrated genomes// ====
You can also ask PhyloPhlAn to try to automatically assign a taxonomic labels to the genomes integrated into the tree of life ({{{-i}}} option introduced above). This is possible simply adding the {{{-t}}} flag (for taxonomic analysis) to the same command line: {{{ #!bash $ ./phylophlan.py -i -t my_genomes_to_insert --nproc 16 }}} In addition to the {{{output/my_genomes_to_insert/my_genomes_to_insert.tree.int.nwk}}} file, you will obtain tab-separated text files with the most confident taxonomic predictions for your genomes in the {{{output/my_genomes_to_insert/}}} folder.
==== //Example 3: predicting the taxonomic labels of three "new" genomes// ====
Suppose you don't know the taxonomic labels of the Lactobacillus and Sulfolobus genomes used as examples above, possibly because of insufficient phenotipic characterization or because you obtained them with metagenomic assembly. You can call the PhyloPhlAn taxonomic imputation pipeline as: {{{ #!bash $ ./phylophlan.py -i -t example_insertion --nproc 16 }}} And check the predictions in the file that we report below: {{{ Sulfolobus_acidocaldarius_N8 d__Archaea.p__Crenarchaeota.c__Thermoprotei.o__Sulfolobales.f__Sulfolobaceae.g__Sulfolobus.s__?.t__? Lactobacillus_rhamnosus_K_ATCC_8530 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__? Lactobacillus_rhamnosus_LRHMDP3 d__Bacteria.p__Firmicutes.c__Bacilli.o__Lactobacillales.f__Lactobacillaceae.g__Lactobacillus.s__rhamnosus.t__? }}} As expected, the all three genomes are assigned to the right genera. The two lactobacilli could also be assigned to the right species ({{{s__rhamnosus}}}) whereas PhyloPhlAn does not find enough support to assign the Sulfolobus genome to the "acidocaldarius" species.
== All command line options and parameters == {{{ $ ./phylophlan.py -h usage: phylophlan.py [-h] [-i] [-u] [-t] [--tax_test TAX_TEST] [-c] [--cleanall] [--nproc N] [-v] [PROJECT NAME]
NAME AND VERSION: PhyloPhlAn version 0.99 (8 May 2013)
AUTHORS: Nicola Segata ([email protected]) and Curtis Huttenhower ([email protected])
DESCRIPTION PhyloPhlAn is a computational pipeline for reconstructing highly accurate and resolved phylogenetic trees based on whole-genome sequence information. The pipeline is scalable to thousands of genomes and uses the most conserved 400 proteins for extracting the phylogenetic signal. PhyloPhlAn also implements taxonomic curation, estimation, and insertion operations.
positional arguments: PROJECT NAME The basename of the project corresponding to the name of the input data folder inside input/. The input data consist of a collection of multifasta files (extension .faa) containing the proteins in each genome. If the project already exists, the already executed steps are not re-ran. The results will be stored in a folder with the project basename in output/ Multiple project can be generated and they safetely coexists.
optional arguments: -h, --help show this help message and exit -i, --integrate Integrate user genomes into the PhyloPhlAn tree -u, --user_tree Build a phylogenetic tree using user genomes only -t, --taxonomic_analysis Check taxonomic inconsistencies and refine/correct taxonomic labels --tax_test TAX_TEST nerrors:type:taxl:tmin:tex:name (alpha version, experimental!) -c, --clean Clean the final and partial data produced for the specified project. (use --cleanall for removing general installation and database files) --cleanall Remove all instalation and database file leaving untouched the initial compressed data that is automatically extracted and formatted at the first pipeline run. Projects are not remove (specify a project and use -c for removing projects). --nproc N The number of CPUs to use for parallelizing the blasting [default 1, i.e. no parallelism] -v, --version Prints the current PhyloPhlAn version and exit
}}}
== External Software Dependencies ==
- http://www.drive5.com/muscle/ version v3.8.31 or higher must be present in the system path and called "muscle"
- http://www.drive5.com/usearch/ version v5.2.32 (notice that version 6 is currently NOT supported) must be present in the system path and called "usearch"
- http://www.microbesonline.org/fasttree/ version 2.1 or higher must be present in the system path and called "FastTree"
- http://biopython.org/wiki/Download it is a PyPhlAn dependency, actually, but used inside PhyloPhlAn
== Acknowledgements == The authors of PhyloPhlAn would like to thank Ashlee Earl and the Human Microbiome Project Strains Working Group for insightful suggestions, Morgan Price for his helpful comments on applying FastTree, and Levi Waldron, Joshua Reyes and Timothy Tickle for their suggestions on methodology and tree visualization
== Change log ==
Changes in version 0.99 (8 May 2013) {{{ Updates:
- Pyphlan dependency removal
- command line arguments simplified }}}
Changes in version 0.98 (28 July 2012) {{{ Bug fixes:
- missing data file added }}}
Changes in version 0.97 (24 July 2012) {{{ First public release }}}