Merge branch 'dev' into devForPanorama

labgem · Mar 25, 2024 · 86ed54c · 86ed54c
2 parents 501db9a + 67f03e4
commit 86ed54c
Show file tree

Hide file tree

Showing 14 changed files with 191 additions and 97 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -177,7 +177,7 @@ jobs:
         # Default separator is a pipe but a pipe is found in a value of metadata db1. That is why we use another separator here. 
         ppanggolin write_genomes -p mybasicpangenome/pangenome.h5 --output mybasicpangenome/genomes_outputs \
                                 --genomes genome_names.fasta.head.list \
-                                  -f --gff --add_metadata --table --metadata_sep § 
+                                  -f --gff --add_metadata --table --metadata_sep § --proksee
 
         # Pipe separatore is found in metadata source db1. if we don't require this source then the writting with pipe is work fine. 
         ppanggolin write_genomes -p mybasicpangenome/pangenome.h5 --output mybasicpangenome/genomes_outputs_with_metadata -f --gff --proksee --table --add_metadata  --metadata_sources db2 db3 db4 

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.0.4
+2.0.5
diff --git a/docs/index.md b/docs/index.md
@@ -73,6 +73,7 @@ user/PangenomeAnalyses/pangenomeAnalyses
 user/RGP/rgpAnalyses
 user/Modules/moduleAnalyses
 user/writeGenomes
+user/writeFasta
 user/align
 user/projection
 user/genomicContext

diff --git a/docs/user/PangenomeAnalyses/pangenomeAnnotation.md b/docs/user/PangenomeAnalyses/pangenomeAnnotation.md
@@ -8,7 +8,7 @@ If you do so, the provided genomes will be annotated using the following tools:
 - [ARAGORN](http://www.ansikte.se/ARAGORN/) to annotate tRNAs
 - [Infernal](http://eddylab.org/infernal/) coupled with HMM of the bacterial and archaeal rRNAs downloaded from [RFAM](https://rfam.xfam.org/) to annotate rRNAs.
 
-To proceed with this stage of the pipeline, you need to create an **organisms.fasta.list** file. 
+To proceed with this stage of the pipeline, you need to create an **genomes.fasta.list** file. 
 This file should be tab-separated with each line depicting an individual genome and
 its pertinent information with the following organization (only the first two columns are mandatory):
 
@@ -17,12 +17,12 @@ its pertinent information with the following organization (only the first two co
 - The following columns contain Contig identifiers present in the associated FASTA file that should be analyzed as being circular.
 For the 'circular contig identifiers,' if you do not have access to this information, you can safely ignore this part as it does not have a big impact on the resulting pangenome.
 
-You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list).
+You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list).
 
 To run the annotation part, you can use this minimal command:
 
 ```
-ppanggolin annotate --fasta organisms.fasta.list
+ppanggolin annotate --fasta genomes.fasta.list
 ```
 
 #### Use a different genetic code in my annotation step
@@ -48,7 +48,7 @@ to specify Infernal's RNA annotation model.
 
 ### Use annotation files for your pangenome
 
-You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list).
+You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list).
 
 ```{note}
 Use your own annotation for your genome is highly recommended, particularly if you already
@@ -58,7 +58,7 @@ have functional annotations, as they can be added to the pangenome.
 You can provide them using the following command: 
 
 ```
-ppanggolin annotate --anno organisms.gbff.list
+ppanggolin annotate --anno genomes.gbff.list
 ```
 
 #### How to deal with annotation files without sequences
@@ -67,7 +67,7 @@ If your annotation files do not contain the genome sequence,
 you can use both options simultaneously to obtain the gene annotations and gene sequences, as follows: 
 
 ```
-ppanggolin annotate --anno organisms.gbff.list --fasta organisms.fasta.list
+ppanggolin annotate --anno genomes.gbff.list --fasta genomes.fasta.list
 ```
 
 #### Take the pseudogenes into account for pangenome analyses

diff --git a/docs/user/PangenomeAnalyses/pangenomeGraphOut.md b/docs/user/PangenomeAnalyses/pangenomeGraphOut.md
@@ -12,7 +12,7 @@ Using Gephi, the layout can be tuned as illustrated below:
 We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker with the layout parameters.
 
 In the _light.gexf file : 
-The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of organisms that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family.
+The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of genomes that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family.
 
 The edges contain the number of times they are present in the pangenome.
 

diff --git a/docs/user/PangenomeAnalyses/pangenomeWorkflow.md b/docs/user/PangenomeAnalyses/pangenomeWorkflow.md
@@ -45,20 +45,20 @@ To use this command, you need to provide a tab-separated list of either annotati
 
 You can use the workflow with annotation files as such: 
 ```
-ppanggolin workflow --anno organism.gbff.list
+ppanggolin workflow --anno genomes.gbff.list
 ```
 
 For fasta files, you have to change for: 
 ```
-ppanggolin workflow --fasta organism.fasta.list
+ppanggolin workflow --fasta genomes.fasta.list
 ```
 
 Moreover, as detailed [in the section about providing your gene families](./pangenomeAnalyses.md#read-clustering), 
 if you wish to use different gene clustering methods than those provided by PPanGGOLiN,
 it is also possible to provide your own clustering results with the workflow command as such:
 
 ```
-ppanggolin workflow --anno organism.gbff.list --clusters clusters.tsv
+ppanggolin workflow --anno genomes.gbff.list --clusters clusters.tsv
 ```
 
 All the workflow parameters are obtained from the commands explained below, except for the `--no_flat_files` option, which solely pertains to it. This option prevents the automatic generation of the output files listed and described [in the pangenome output section](./pangenomeAnalyses.md#pangenome-outputs).

diff --git a/docs/user/QuickUsage/quickWorkflow.md b/docs/user/QuickUsage/quickWorkflow.md
@@ -63,24 +63,24 @@ The minimal subcommand only need your own annotations files (using `.gff` or `.g
 as long as they include the genomic dna sequences, such as the ones provided by Prokka or Bakta.
 
 ```bash
-ppanggolin all --anno organism.gbff.list
+ppanggolin all --anno genomes.gbff.list
 ```
 
 It uses parameters that we found to be generally the best when working with species pangenomes.
 
-The file **organism.gbff.list** is a tab-separated file with the following organisation :
+The file **genomes.gbff.list** is a tab-separated file with the following organisation :
 
 1. The first column contains a unique genome name
 2. The second column the path to the associated annotation file
 3. Each line represents a genome
 
-An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list) directory.
+An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list) directory.
 
 [//]: # (### PPanGGOLiN: Pangenome analyses from list of fasta files)
 You can also give PPanGGOLiN `.fasta` files, such as:
 
 ```
-ppanggolin all --fasta organism.fasta.list
+ppanggolin all --fasta genomes.fasta.list
 ```
 
 Again you must use a tab-separated file but this time with the following organisation:
@@ -90,7 +90,7 @@ Again you must use a tab-separated file but this time with the following organis
 3. Circular contig identifiers are indicated in the following columns
 4. Each line represents a genome
 
-Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list) directory.
+Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list) directory.
 
 ```{tip}
 Downloading genomes from NCBI refseq or genbank for a species of interest can be easily accomplished using CLI tools like [ncbi-genome-download](https://github.com/kblin/ncbi-genome-download) or the [genome updater](https://github.com/pirovc/genome_updater) script.

diff --git a/docs/user/RGP/rgpPrediction.md b/docs/user/RGP/rgpPrediction.md
@@ -59,12 +59,12 @@ graph LR
 
 You can use the `panrgp` with annotation (gff3 or gbff) files with `--anno` option, as such: 
 ```bash
-ppanggolin panrgp --anno organism.gbff.list
+ppanggolin panrgp --anno genomes.gbff.list
 ```
 
 For fasta files, you need to use the alternative `--fasta` option, as such:
 ```bash
-ppanggolin panrgp --fasta organism.fasta.list
+ppanggolin panrgp --fasta genomes.fasta.list
 ```
 
 Just like [workflow](../PangenomeAnalyses/pangenomeAnalyses.md#workflow), this command will deal with the [annotation](../PangenomeAnalyses/pangenomeAnalyses.md#annotation), [clustering](../PangenomeAnalyses/pangenomeAnalyses.md#compute-pangenome-gene-families), [graph](../PangenomeAnalyses/pangenomeAnalyses.md#graph) and [partition](../PangenomeAnalyses/pangenomeAnalyses.md#partition) commands by itself.

diff --git a/docs/user/practicalInformation.md b/docs/user/practicalInformation.md
@@ -85,12 +85,12 @@ ppanggolin utils --default_config panrgp
 
 ```yaml
 input_parameters:
-    # A tab-separated file listing the organism names, and the fasta filepath of its
-    # genomic sequence(s) (the fastas can be compressed with gzip). One line per organism.
+    # A tab-separated file listing the genome names, and the fasta filepath of its
+    # genomic sequence(s) (the fastas can be compressed with gzip). One line per genome.
   # fasta: <fasta file>
-    # A tab-separated file listing the organism names, and the gff/gbff filepath of
+    # A tab-separated file listing the genome names, and the gff/gbff filepath of
     # its annotations (the files can be compressed with gzip). One line
-    # per organism. If this is provided, those annotations will be used.
+    # per genome. If this is provided, those annotations will be used.
   # anno: <anno file>
 
 general_parameters:

diff --git a/docs/user/writeFasta.md b/docs/user/writeFasta.md
@@ -0,0 +1,76 @@
+
+# Write pangenome sequences
+
+The `fasta` command can be used to write sequences of the pangenome or specific parts of the pangenome in FASTA format. 
+
+Most options require a partition.
+
+Available partitions are:
+* `all` for the entire pangenome.
+* `Persistent` for persistent families
+* `Shell` for shell genes or families
+* `Cloud` for cloud genes or families
+* `rgp` for genes or families found in RGPs
+* `core` for core genes or families
+* `softcore` for softcore genes or families
+
+When using the `softcore` filter, the `--soft_core` option can be used to modify the threshold used to determine what is part of the softcore. It is set to 0.95 by default.
+
+## Genes
+
+This option can be used to write the nucleotide CDS sequences. It can be used as such, to write all of the genes of the pangenome for example:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes all
+```
+
+Or to write only the persistent genes:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes persistent
+```
+
+
+## Protein families
+
+This option can be used to write the protein sequences of the representative sequences for each family. It can be used as such for all families:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families all
+```
+
+or for all of the shell families for example:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families shell
+```
+
+
+## Gene families
+
+This option can be used to write the gene sequences of the representative sequences for each family. It can be used as such:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_GENES_FAMILIES --gene_families all
+```
+
+or for the cloud families for example:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_GENES_FAMILIES --gene_families cloud
+```
+
+## Regions
+
+This option can be used to write the nucleotide sequences of the detected RGPs.
+It requires the fasta sequences used to compute the pangenome, as originally provided when you computed your pangenome.
+
+This command has only two filters:
+* all, for all regions
+* complete, for only the 'complete' regions which are not on a contig border
+
+It can be used as such:
+
+```bash
+ppanggolin fasta -p pangenome.h5 --output MY_REGIONS --regions all --fasta genomes.fasta.list
+```
diff --git a/docs/user/writeGenomes.md b/docs/user/writeGenomes.md
@@ -2,7 +2,7 @@
 
 The `write_genomes` command creates 'flat' files representing genomes with their pangenome annotations.
 
-To generate output for specific genomes, use the `--organisms` argument. This argument accepts a list of organism names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single organism name.
+To generate output for specific genomes, use the `--genomes` argument. This argument accepts a list of genome names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single genome name.
 
 
 ### Genes table with pangenome annotations
@@ -20,7 +20,7 @@ The following table outlines the columns present in the generated files:
 | stop                 | Stop position of the gene                                                  |
 | strand               | Gene location strand                                      |
 | family               | ID of the gene's associated family in the pangenome             |
-| nb_copy_in_org       | Number of copies of a family present in the organism; 1 indicates no close paralogs |
+| nb_copy_in_org       | Number of copies of a family present in the genome; 1 indicates no close paralogs |
 | partition            | Gene family partition in the pangenome                  |
 | persistent_neighbors | Number of neighbors classified as 'persistent' in the pangenome graph        |
 | shell_neighbors      | Number of neighbors classified as 'shell' in the pangenome graph             |
@@ -137,9 +137,9 @@ PPanGGOLiN allows the incorporation of fasta sequences into GFF files and prokse
 
 Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command.
 
-- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.
+- `--anno`: This option requires a tab-separated file containing genome names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.
 
-- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format.
+- `--fasta`: Use this option with a tab-separated file that lists genome names alongside the filepaths of their genomic sequences in fasta format.
 
 
 ### Incorporating Metadata into Tables, GFF, and Proksee Files

diff --git a/ppanggolin/formats/writeFlatGenomes.py b/ppanggolin/formats/writeFlatGenomes.py
@@ -442,7 +442,7 @@ def mp_write_genomes_file(organism: Organism, output: Path, organisms_file: Path
 
         # Write ProkSee data for the organism
         write_proksee_organism(organism, output_file, features=['all'], genome_sequences=genome_sequences,
-                               **{arg: kwargs[arg] for arg in kwargs.keys() & {'module_to_colors', 'compress'}})
+                               **{arg: kwargs[arg] for arg in kwargs.keys() & {'module_to_colors', 'compress', 'metadata_sep'}})
 
     if gff:
         gff_outdir = output / "gff"
@@ -519,7 +519,9 @@ def write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = Fa
     organism2args = defaultdict(lambda: {"output": output, "table": table, "gff": gff,
                                          "proksee": proksee, "compress": compress})
     for organism in organisms_list:
-        organism_args = {"genome_file": org_dict[organism.name]['path'] if org_dict else None}
+        organism_args = {"genome_file": org_dict[organism.name]['path'] if org_dict else None,
+                         "metadata_sep":  metadata_sep}
+
         if proksee:
             organism_args["module_to_colors"] = {module: module_to_colors[module] for module in organism.modules}
 
@@ -531,7 +533,7 @@ def write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = Fa
                                                        "CDS": "external"}
             else:
                 organism_args["annotation_sources"] = {}
-            organism_args["metadata_sep"] = metadata_sep
+
         if table:
             organism_args.update({"need_regions": need_dict['need_rgp'],
                                   "need_modules": need_dict['need_modules'],