Skip to content

Commit

Permalink
Merge branch 'dev' into devForPanorama
Browse files Browse the repository at this point in the history
  • Loading branch information
jpjarnoux committed Mar 25, 2024
2 parents 501db9a + 67f03e4 commit 86ed54c
Show file tree
Hide file tree
Showing 14 changed files with 191 additions and 97 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ jobs:
# Default separator is a pipe but a pipe is found in a value of metadata db1. That is why we use another separator here.
ppanggolin write_genomes -p mybasicpangenome/pangenome.h5 --output mybasicpangenome/genomes_outputs \
--genomes genome_names.fasta.head.list \
-f --gff --add_metadata --table --metadata_sep §
-f --gff --add_metadata --table --metadata_sep § --proksee
# Pipe separatore is found in metadata source db1. if we don't require this source then the writting with pipe is work fine.
ppanggolin write_genomes -p mybasicpangenome/pangenome.h5 --output mybasicpangenome/genomes_outputs_with_metadata -f --gff --proksee --table --add_metadata --metadata_sources db2 db3 db4
Expand Down
2 changes: 1 addition & 1 deletion VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.0.4
2.0.5
1 change: 1 addition & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ user/PangenomeAnalyses/pangenomeAnalyses
user/RGP/rgpAnalyses
user/Modules/moduleAnalyses
user/writeGenomes
user/writeFasta
user/align
user/projection
user/genomicContext
Expand Down
12 changes: 6 additions & 6 deletions docs/user/PangenomeAnalyses/pangenomeAnnotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ If you do so, the provided genomes will be annotated using the following tools:
- [ARAGORN](http://www.ansikte.se/ARAGORN/) to annotate tRNAs
- [Infernal](http://eddylab.org/infernal/) coupled with HMM of the bacterial and archaeal rRNAs downloaded from [RFAM](https://rfam.xfam.org/) to annotate rRNAs.

To proceed with this stage of the pipeline, you need to create an **organisms.fasta.list** file.
To proceed with this stage of the pipeline, you need to create an **genomes.fasta.list** file.
This file should be tab-separated with each line depicting an individual genome and
its pertinent information with the following organization (only the first two columns are mandatory):

Expand All @@ -17,12 +17,12 @@ its pertinent information with the following organization (only the first two co
- The following columns contain Contig identifiers present in the associated FASTA file that should be analyzed as being circular.
For the 'circular contig identifiers,' if you do not have access to this information, you can safely ignore this part as it does not have a big impact on the resulting pangenome.

You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list).
You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list).

To run the annotation part, you can use this minimal command:

```
ppanggolin annotate --fasta organisms.fasta.list
ppanggolin annotate --fasta genomes.fasta.list
```

#### Use a different genetic code in my annotation step
Expand All @@ -48,7 +48,7 @@ to specify Infernal's RNA annotation model.

### Use annotation files for your pangenome

You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list).
You can provide annotation files in either gff3 files or .gbk/.gbff files, or a mix of them. They should be provided through as a list in a tab-separated file that follows the same format as described for the fasta files. You can check [this example input file](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list).

```{note}
Use your own annotation for your genome is highly recommended, particularly if you already
Expand All @@ -58,7 +58,7 @@ have functional annotations, as they can be added to the pangenome.
You can provide them using the following command:

```
ppanggolin annotate --anno organisms.gbff.list
ppanggolin annotate --anno genomes.gbff.list
```

#### How to deal with annotation files without sequences
Expand All @@ -67,7 +67,7 @@ If your annotation files do not contain the genome sequence,
you can use both options simultaneously to obtain the gene annotations and gene sequences, as follows:

```
ppanggolin annotate --anno organisms.gbff.list --fasta organisms.fasta.list
ppanggolin annotate --anno genomes.gbff.list --fasta genomes.fasta.list
```

#### Take the pseudogenes into account for pangenome analyses
Expand Down
2 changes: 1 addition & 1 deletion docs/user/PangenomeAnalyses/pangenomeGraphOut.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Using Gephi, the layout can be tuned as illustrated below:
We advise the Gephi "Force Atlas 2" algorithm to compute the graph layout with "Stronger Gravity: on" and "scaling: 4000" but don't hesitate to tinker with the layout parameters.

In the _light.gexf file :
The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of organisms that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family.
The nodes will contain the number of genes belonging to the gene family, the most common gene name (if you provided annotations), the most common product name (if you provided annotations in your GFF or GBFF input files), the partitions it belongs to, its average and median size in nucleotides, and the number of genomes that have this gene family. If spots or modules are computed, it also indicates if a node belongs to them. Finally, this file also outputs the imported metadata regarding each gene family.

The edges contain the number of times they are present in the pangenome.

Expand Down
6 changes: 3 additions & 3 deletions docs/user/PangenomeAnalyses/pangenomeWorkflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,20 +45,20 @@ To use this command, you need to provide a tab-separated list of either annotati

You can use the workflow with annotation files as such:
```
ppanggolin workflow --anno organism.gbff.list
ppanggolin workflow --anno genomes.gbff.list
```

For fasta files, you have to change for:
```
ppanggolin workflow --fasta organism.fasta.list
ppanggolin workflow --fasta genomes.fasta.list
```

Moreover, as detailed [in the section about providing your gene families](./pangenomeAnalyses.md#read-clustering),
if you wish to use different gene clustering methods than those provided by PPanGGOLiN,
it is also possible to provide your own clustering results with the workflow command as such:

```
ppanggolin workflow --anno organism.gbff.list --clusters clusters.tsv
ppanggolin workflow --anno genomes.gbff.list --clusters clusters.tsv
```

All the workflow parameters are obtained from the commands explained below, except for the `--no_flat_files` option, which solely pertains to it. This option prevents the automatic generation of the output files listed and described [in the pangenome output section](./pangenomeAnalyses.md#pangenome-outputs).
Expand Down
10 changes: 5 additions & 5 deletions docs/user/QuickUsage/quickWorkflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,24 +63,24 @@ The minimal subcommand only need your own annotations files (using `.gff` or `.g
as long as they include the genomic dna sequences, such as the ones provided by Prokka or Bakta.

```bash
ppanggolin all --anno organism.gbff.list
ppanggolin all --anno genomes.gbff.list
```

It uses parameters that we found to be generally the best when working with species pangenomes.

The file **organism.gbff.list** is a tab-separated file with the following organisation :
The file **genomes.gbff.list** is a tab-separated file with the following organisation :

1. The first column contains a unique genome name
2. The second column the path to the associated annotation file
3. Each line represents a genome

An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.gbff.list) directory.
An example with 50 _Chlamydia trachomatis_ genomes can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.gbff.list) directory.

[//]: # (### PPanGGOLiN: Pangenome analyses from list of fasta files)
You can also give PPanGGOLiN `.fasta` files, such as:

```
ppanggolin all --fasta organism.fasta.list
ppanggolin all --fasta genomes.fasta.list
```

Again you must use a tab-separated file but this time with the following organisation:
Expand All @@ -90,7 +90,7 @@ Again you must use a tab-separated file but this time with the following organis
3. Circular contig identifiers are indicated in the following columns
4. Each line represents a genome

Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/organisms.fasta.list) directory.
Same, an example can be found in the [testingDataset](https://github.com/labgem/PPanGGOLiN/blob/master/testingDataset/genomes.fasta.list) directory.

```{tip}
Downloading genomes from NCBI refseq or genbank for a species of interest can be easily accomplished using CLI tools like [ncbi-genome-download](https://github.com/kblin/ncbi-genome-download) or the [genome updater](https://github.com/pirovc/genome_updater) script.
Expand Down
4 changes: 2 additions & 2 deletions docs/user/RGP/rgpPrediction.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,12 +59,12 @@ graph LR

You can use the `panrgp` with annotation (gff3 or gbff) files with `--anno` option, as such:
```bash
ppanggolin panrgp --anno organism.gbff.list
ppanggolin panrgp --anno genomes.gbff.list
```

For fasta files, you need to use the alternative `--fasta` option, as such:
```bash
ppanggolin panrgp --fasta organism.fasta.list
ppanggolin panrgp --fasta genomes.fasta.list
```

Just like [workflow](../PangenomeAnalyses/pangenomeAnalyses.md#workflow), this command will deal with the [annotation](../PangenomeAnalyses/pangenomeAnalyses.md#annotation), [clustering](../PangenomeAnalyses/pangenomeAnalyses.md#compute-pangenome-gene-families), [graph](../PangenomeAnalyses/pangenomeAnalyses.md#graph) and [partition](../PangenomeAnalyses/pangenomeAnalyses.md#partition) commands by itself.
Expand Down
8 changes: 4 additions & 4 deletions docs/user/practicalInformation.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,12 +85,12 @@ ppanggolin utils --default_config panrgp

```yaml
input_parameters:
# A tab-separated file listing the organism names, and the fasta filepath of its
# genomic sequence(s) (the fastas can be compressed with gzip). One line per organism.
# A tab-separated file listing the genome names, and the fasta filepath of its
# genomic sequence(s) (the fastas can be compressed with gzip). One line per genome.
# fasta: <fasta file>
# A tab-separated file listing the organism names, and the gff/gbff filepath of
# A tab-separated file listing the genome names, and the gff/gbff filepath of
# its annotations (the files can be compressed with gzip). One line
# per organism. If this is provided, those annotations will be used.
# per genome. If this is provided, those annotations will be used.
# anno: <anno file>

general_parameters:
Expand Down
76 changes: 76 additions & 0 deletions docs/user/writeFasta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@

# Write pangenome sequences

The `fasta` command can be used to write sequences of the pangenome or specific parts of the pangenome in FASTA format.

Most options require a partition.

Available partitions are:
* `all` for the entire pangenome.
* `Persistent` for persistent families
* `Shell` for shell genes or families
* `Cloud` for cloud genes or families
* `rgp` for genes or families found in RGPs
* `core` for core genes or families
* `softcore` for softcore genes or families

When using the `softcore` filter, the `--soft_core` option can be used to modify the threshold used to determine what is part of the softcore. It is set to 0.95 by default.

## Genes

This option can be used to write the nucleotide CDS sequences. It can be used as such, to write all of the genes of the pangenome for example:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes all
```

Or to write only the persistent genes:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes persistent
```


## Protein families

This option can be used to write the protein sequences of the representative sequences for each family. It can be used as such for all families:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families all
```

or for all of the shell families for example:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families shell
```


## Gene families

This option can be used to write the gene sequences of the representative sequences for each family. It can be used as such:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES_FAMILIES --gene_families all
```

or for the cloud families for example:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES_FAMILIES --gene_families cloud
```

## Regions

This option can be used to write the nucleotide sequences of the detected RGPs.
It requires the fasta sequences used to compute the pangenome, as originally provided when you computed your pangenome.

This command has only two filters:
* all, for all regions
* complete, for only the 'complete' regions which are not on a contig border

It can be used as such:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_REGIONS --regions all --fasta genomes.fasta.list
```
8 changes: 4 additions & 4 deletions docs/user/writeGenomes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The `write_genomes` command creates 'flat' files representing genomes with their pangenome annotations.

To generate output for specific genomes, use the `--organisms` argument. This argument accepts a list of organism names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single organism name.
To generate output for specific genomes, use the `--genomes` argument. This argument accepts a list of genome names, either directly entered in the command line (comma-separated) or referenced from a file where each line contains a single genome name.


### Genes table with pangenome annotations
Expand All @@ -20,7 +20,7 @@ The following table outlines the columns present in the generated files:
| stop | Stop position of the gene |
| strand | Gene location strand |
| family | ID of the gene's associated family in the pangenome |
| nb_copy_in_org | Number of copies of a family present in the organism; 1 indicates no close paralogs |
| nb_copy_in_org | Number of copies of a family present in the genome; 1 indicates no close paralogs |
| partition | Gene family partition in the pangenome |
| persistent_neighbors | Number of neighbors classified as 'persistent' in the pangenome graph |
| shell_neighbors | Number of neighbors classified as 'shell' in the pangenome graph |
Expand Down Expand Up @@ -137,9 +137,9 @@ PPanGGOLiN allows the incorporation of fasta sequences into GFF files and prokse

Since PPanGGOLiN does not retain genomic sequences, it is necessary to provide the original genomic files used to construct the pangenome through either the `--anno` or `--fasta` argument. These arguments mirror those used in workflow commands (`workflow`, `all`, `panrgp`, `panmodule`) and the `annotate` command.

- `--anno`: This option requires a tab-separated file containing organism names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.
- `--anno`: This option requires a tab-separated file containing genome names and the corresponding GFF/GBFF file paths of their annotations. If `--anno` is utilized, GFF files should include fasta sequences.

- `--fasta`: Use this option with a tab-separated file that lists organism names alongside the filepaths of their genomic sequences in fasta format.
- `--fasta`: Use this option with a tab-separated file that lists genome names alongside the filepaths of their genomic sequences in fasta format.


### Incorporating Metadata into Tables, GFF, and Proksee Files
Expand Down
8 changes: 5 additions & 3 deletions ppanggolin/formats/writeFlatGenomes.py
Original file line number Diff line number Diff line change
Expand Up @@ -442,7 +442,7 @@ def mp_write_genomes_file(organism: Organism, output: Path, organisms_file: Path

# Write ProkSee data for the organism
write_proksee_organism(organism, output_file, features=['all'], genome_sequences=genome_sequences,
**{arg: kwargs[arg] for arg in kwargs.keys() & {'module_to_colors', 'compress'}})
**{arg: kwargs[arg] for arg in kwargs.keys() & {'module_to_colors', 'compress', 'metadata_sep'}})

if gff:
gff_outdir = output / "gff"
Expand Down Expand Up @@ -519,7 +519,9 @@ def write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = Fa
organism2args = defaultdict(lambda: {"output": output, "table": table, "gff": gff,
"proksee": proksee, "compress": compress})
for organism in organisms_list:
organism_args = {"genome_file": org_dict[organism.name]['path'] if org_dict else None}
organism_args = {"genome_file": org_dict[organism.name]['path'] if org_dict else None,
"metadata_sep": metadata_sep}

if proksee:
organism_args["module_to_colors"] = {module: module_to_colors[module] for module in organism.modules}

Expand All @@ -531,7 +533,7 @@ def write_flat_genome_files(pangenome: Pangenome, output: Path, table: bool = Fa
"CDS": "external"}
else:
organism_args["annotation_sources"] = {}
organism_args["metadata_sep"] = metadata_sep

if table:
organism_args.update({"need_regions": need_dict['need_rgp'],
"need_modules": need_dict['need_modules'],
Expand Down
Loading

0 comments on commit 86ed54c

Please sign in to comment.