Skip to content

Commit

Permalink
Merge pull request #201 from labgem/master
Browse files Browse the repository at this point in the history
synchornise dev with master
  • Loading branch information
JeanMainguy authored Mar 22, 2024
2 parents 4a48b39 + f3ba6a1 commit aba6caa
Show file tree
Hide file tree
Showing 7 changed files with 25 additions and 25 deletions.
12 changes: 6 additions & 6 deletions docs/user/PangenomeAnalyses/pangenomeAnnotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ If you do so, the provided genomes will be annotated using the following tools:
- [ARAGORN](http://www.ansikte.se/ARAGORN/) to annotate tRNAs
- [Infernal](http://eddylab.org/infernal/) coupled with HMM of the bacterial and archaeal rRNAs downloaded from [RFAM](https://rfam.xfam.org/) to annotate rRNAs.

To proceed with this stage of the pipeline, you need to create an **organisms.fasta.list** file.
To proceed with this stage of the pipeline, you need to create an **genomes.fasta.list** file.
This file should be tab-separated with each line depicting an individual genome and
its pertinent information with the following organization (only the first two columns are mandatory):

Expand All @@ -17,12 +17,12 @@ its pertinent information with the following organization (only the first two co
- The following columns contain Contig identifiers present in the associated FASTA file that should be analyzed as being circular.
For the 'circular contig identifiers,' if you do not have access to this information, you can safely ignore this part as it does not have a big impact on the resulting pangenome.

You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list).
You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list).

To run the annotation part, you can use this minimal command:

```
ppanggolin annotate --fasta organisms.fasta.list
ppanggolin annotate --fasta genomes.fasta.list
```

#### Use a different genetic code in my annotation step
Expand All @@ -48,7 +48,7 @@ to specify Infernal's RNA annotation model.

### Use annotation files for your pangenome

You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list).
You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list).

```{note}
Use your own annotation for your genome is highly recommended, particularly if you already
Expand All @@ -58,7 +58,7 @@ have functional annotations, as they can be added to the pangenome.
You can provide them using the following command:

```
ppanggolin annotate --anno organisms.gbff.list
ppanggolin annotate --anno genomes.gbff.list
```

#### How to deal with annotation files without sequences
Expand All @@ -67,7 +67,7 @@ If your annotation files do not contain the genome sequence,
you can use both options simultaneously to obtain the gene annotations and gene sequences, as follows:

```
ppanggolin annotate --anno organisms.gbff.list --fasta organisms.fasta.list
ppanggolin annotate --anno genomes.gbff.list --fasta genomes.fasta.list
```

#### Take the pseudogenes into account for pangenome analyses
Expand Down
2 changes: 1 addition & 1 deletion docs/user/PangenomeAnalyses/pangenomeGraphOut.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Using Gephi, the layout can be tuned as illustrated below:
We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker with the layout parameters.

In the _light.gexf file :
The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of organisms that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family.
The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of genomes that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family.

The edges contain the number of times they are present in the pangenome.

Expand Down
6 changes: 3 additions & 3 deletions docs/user/PangenomeAnalyses/pangenomeWorkflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,20 @@ To use this command, you need to provide a tab-separated list of either annotati

You can use the workflow with annotation files as such:
```
ppanggolin workflow --anno organism.gbff.list
ppanggolin workflow --anno genomes.gbff.list
```

For fasta files, you have to change for:
```
ppanggolin workflow --fasta organism.fasta.list
ppanggolin workflow --fasta genomes.fasta.list
```

Moreover, as detailed [in the section about providing your gene families](./pangenomeAnalyses.md#read-clustering),
if you wish to use different gene clustering methods than those provided by PPanGGOLiN,
it is also possible to provide your own clustering results with the workflow command as such:

```
ppanggolin workflow --anno organism.gbff.list --clusters clusters.tsv
ppanggolin workflow --anno genomes.gbff.list --clusters clusters.tsv
```

All the workflow parameters are obtained from the commands explained below, except for the `--no_flat_files` option, which solely pertains to it. This option prevents the automatic generation of the output files listed and described [in the pangenome output section](./pangenomeAnalyses.md#pangenome-outputs).
Expand Down
10 changes: 5 additions & 5 deletions docs/user/QuickUsage/quickWorkflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,24 +63,24 @@ The minimal subcommand only need your own annotations files (using `.gff` or `.g
as long as they include the genomic dna sequences, such as the ones provided by Prokka or Bakta.

```bash
ppanggolin all --anno organism.gbff.list
ppanggolin all --anno genomes.gbff.list
```

It uses parameters that we found to be generally the best when working with species pangenomes.

The file **organism.gbff.list** is a tab-separated file with the following organisation :
The file **genomes.gbff.list** is a tab-separated file with the following organisation :

1. The first column contains a unique genome name
2. The second column the path to the associated annotation file
3. Each line represents a genome

An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list) directory.
An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list) directory.

[//]: # (### PPanGGOLiN: Pangenome analyses from list of fasta files)
You can also give PPanGGOLiN `.fasta` files, such as:

```
ppanggolin all --fasta organism.fasta.list
ppanggolin all --fasta genomes.fasta.list
```

Again you must use a tab-separated file but this time with the following organisation:
Expand All @@ -90,7 +90,7 @@ Again you must use a tab-separated file but this time with the following organis
3. Circular contig identifiers are indicated in the following columns
4. Each line represents a genome

Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list) directory.
Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list) directory.

```{tip}
Downloading genomes from NCBI refseq or genbank for a species of interest can be easily accomplished using CLI tools like [ncbi-genome-download](https://github.com/kblin/ncbi-genome-download) or the [genome updater](https://github.com/pirovc/genome_updater) script.
Expand Down
4 changes: 2 additions & 2 deletions docs/user/RGP/rgpPrediction.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,12 @@ graph LR

You can use the `panrgp` with annotation (gff3 or gbff) files with `--anno` option, as such:
```bash
ppanggolin panrgp --anno organism.gbff.list
ppanggolin panrgp --anno genomes.gbff.list
```

For fasta files, you need to use the alternative `--fasta` option, as such:
```bash
ppanggolin panrgp --fasta organism.fasta.list
ppanggolin panrgp --fasta genomes.fasta.list
```

Just like [workflow](../PangenomeAnalyses/pangenomeAnalyses.md#workflow), this command will deal with the [annotation](../PangenomeAnalyses/pangenomeAnalyses.md#annotation), [clustering](../PangenomeAnalyses/pangenomeAnalyses.md#compute-pangenome-gene-families), [graph](../PangenomeAnalyses/pangenomeAnalyses.md#graph) and [partition](../PangenomeAnalyses/pangenomeAnalyses.md#partition) commands by itself.
Expand Down
8 changes: 4 additions & 4 deletions docs/user/practicalInformation.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,12 @@ ppanggolin utils --default_config panrgp

```yaml
input_parameters:
# A tab-separated file listing the organism names, and the fasta filepath of its
# genomic sequence(s) (the fastas can be compressed with gzip). One line per organism.
# A tab-separated file listing the genome names, and the fasta filepath of its
# genomic sequence(s) (the fastas can be compressed with gzip). One line per genome.
# fasta: <fasta file>
# A tab-separated file listing the organism names, and the gff/gbff filepath of
# A tab-separated file listing the genome names, and the gff/gbff filepath of
# its annotations (the files can be compressed with gzip). One line
# per organism. If this is provided, those annotations will be used.
# per genome. If this is provided, those annotations will be used.
# anno: <anno file>

general_parameters:
Expand Down
8 changes: 4 additions & 4 deletions docs/user/writeGenomes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The `write_genomes` command creates 'flat' files representing genomes with their pangenome annotations.

To generate output for specific genomes, use the `--organisms` argument. This argument accepts a list of organism names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single organism name.
To generate output for specific genomes, use the `--genomes` argument. This argument accepts a list of genome names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single genome name.


### Genes table with pangenome annotations
Expand All @@ -20,7 +20,7 @@ The following table outlines the columns present in the generated files:
| stop | Stop position of the gene |
| strand | Gene location strand |
| family | ID of the gene's associated family in the pangenome |
| nb_copy_in_org | Number of copies of a family present in the organism; 1 indicates no close paralogs |
| nb_copy_in_org | Number of copies of a family present in the genome; 1 indicates no close paralogs |
| partition | Gene family partition in the pangenome |
| persistent_neighbors | Number of neighbors classified as 'persistent' in the pangenome graph |
| shell_neighbors | Number of neighbors classified as 'shell' in the pangenome graph |
Expand Down Expand Up @@ -137,9 +137,9 @@ PPanGGOLiN allows the incorporation of fasta sequences into GFF files and prokse

Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command.

- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.
- `--anno`: This option requires a tab-separated file containing genome names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.

- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format.
- `--fasta`: Use this option with a tab-separated file that lists genome names alongside the filepaths of their genomic sequences in fasta format.


### Incorporating Metadata into Tables, GFF, and Proksee Files
Expand Down

0 comments on commit aba6caa

Please sign in to comment.