Skip to content

Commit

Permalink
Merge branch 'TranslateGene' into devForPanorama
Browse files Browse the repository at this point in the history
  • Loading branch information
jpjarnoux committed Mar 28, 2024
2 parents 70fc067 + 12186a5 commit 16d933e
Show file tree
Hide file tree
Showing 4 changed files with 221 additions and 92 deletions.
1 change: 1 addition & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,7 @@ jobs:
ppanggolin fasta -p stepbystep/pangenome.h5 --output stepbystep -f --prot_families module_0
ppanggolin fasta -p stepbystep/pangenome.h5 --output stepbystep -f --prot_families core
ppanggolin fasta -p stepbystep/pangenome.h5 --output stepbystep -f --gene_families module_0 --genes module_0
ppanggolin fasta -p stepbystep/pangenome.h5 --output stepbystep -f --genes_prot -c 1 --keep_tmp
ppanggolin draw -p stepbystep/pangenome.h5 --draw_spots --spots all -o stepbystep -f
ppanggolin metrics -p stepbystep/pangenome.h5 --genome_fluidity --no_print_info --recompute_metrics --log metrics.log
Expand Down
28 changes: 23 additions & 5 deletions docs/user/writeFasta.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ When using the `softcore` filter, the `--soft_core` option can be used to modify

## Genes

This option can be used to write the nucleotide CDS sequences. It can be used as such, to write all of the genes of the pangenome for example:
### Nucleotide sequences

This option can be used to write the nucleotide CDS sequences. It can be used as such, to write all the genes of the pangenome for example:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes all
Expand All @@ -30,23 +32,39 @@ Or to write only the persistent genes:
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes persistent
```

### Protein sequences

This option can be used to write the amino acid CDS sequences. It can be used as such, to write all the genes of the pangenome for example:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes_prot all
```

Or to write only the cloud genes:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_GENES --genes_prot cloud
```

To translate the genes sequences, PPanGGOLiN use [MMSeqs2](https://github.com/soedinglab/MMseqs2) `translatenucs` command. So for this option you can give multiple threads with `--threads`. You can also specify the translation table to use with `--translate_table`. Finally, you can keep the [MMSeqs2](https://github.com/soedinglab/MMseqs2) that are generated in the temporary directory (that you can also be specified with `--tmpdir`) by indicate the option `--keep_tmp`.

## Protein families
## Gene families

### Protein sequences

This option can be used to write the protein sequences of the representative sequences for each family. It can be used as such for all families:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families all
```

or for all of the shell families for example:
or for all the shell families for example:

```bash
ppanggolin fasta -p pangenome.h5 --output MY_PROT --prot_families shell
```


## Gene families
### Nucleotide sequences

This option can be used to write the gene sequences of the representative sequences for each family. It can be used as such:

Expand Down
14 changes: 2 additions & 12 deletions ppanggolin/cluster/cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
from ppanggolin.utils import read_compressed_or_not, restricted_float
from ppanggolin.formats.writeBinaries import write_pangenome, erase_pangenome
from ppanggolin.formats.readBinaries import check_pangenome_info, write_gene_sequences_from_pangenome_file
from ppanggolin.formats.writeSequences import write_gene_sequences_from_annotations
from ppanggolin.formats.writeSequences import write_gene_sequences_from_annotations, translate_genes
from ppanggolin.utils import mk_outdir


Expand Down Expand Up @@ -70,7 +70,6 @@ def check_pangenome_for_clustering(pangenome: Pangenome, tmp_file: TextIO, force
"or provide a way to access the gene sequence during the annotation step "
"(having the fasta in the gff files, or providing the fasta files through the --fasta option)")


def first_clustering(sequences: TextIO, tmpdir: Path, cpu: int = 1, code: int = 11, coverage: float = 0.8,
identity: float = 0.8, mode: int = 1) -> Tuple[Path, Path]:
"""
Expand All @@ -86,16 +85,7 @@ def first_clustering(sequences: TextIO, tmpdir: Path, cpu: int = 1, code: int =
:return: path to representative sequence file and path to tsv clustering result
"""
seq_nucdb = tmpdir / 'nucleotid_sequences_db'
cmd = list(map(str, ["mmseqs", "createdb", sequences.name, seq_nucdb]))
logging.getLogger("PPanGGOLiN").debug(" ".join(cmd))
logging.getLogger("PPanGGOLiN").info("Creating sequence database...")
subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
logging.getLogger("PPanGGOLiN").debug("Translate sequence ...")
seqdb = tmpdir / 'aa_db'
cmd = list(map(str, ["mmseqs", "translatenucs", seq_nucdb, seqdb, "--threads", cpu, "--translation-table", code]))
logging.getLogger("PPanGGOLiN").debug(" ".join(cmd))
subprocess.run(cmd, stdout=subprocess.DEVNULL, check=True)
seqdb = translate_genes(sequences, tmpdir, cpu, code)
logging.getLogger("PPanGGOLiN").info("Clustering sequences...")
cludb = tmpdir / 'cluster_db'
cmd = list(map(str, ["mmseqs", "cluster", seqdb, cludb, tmpdir, "--cluster-mode", mode, "--min-seq-id",
Expand Down
Loading

0 comments on commit 16d933e

Please sign in to comment.