Skip to content

Commit

Permalink
fix duplicated section and header size
Browse files Browse the repository at this point in the history
  • Loading branch information
JeanMainguy committed Jan 9, 2024
1 parent ff80fb1 commit 540174e
Showing 1 changed file with 5 additions and 20 deletions.
25 changes: 5 additions & 20 deletions docs/user/PangenomeAnalyses/pangenomeStat.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,23 +81,8 @@ The fragmentation value denotes the proportion of families containing fragmented



#### Mean Persistent Duplication

The `mean_persistent_duplication.tsv` is a tab-separated file that lists the gene families and their duplication ratio, their mean presence in the pangenome and whether it is considered a 'single copy marker'. A gene family is considered duplicated it is found in single copy in less than 5% of the genomes by default. This threshold can be adjusted with the parameter `--dup_margin`. And the value that has been used to generated this file is specififed as a comment line strating with a '#'. This notion of single copy marker is used to compute a contamination value for each genome in the [genome statistics table](#genome-statistics-table) described previously, where the contamination is the proportion of single copy marker found in multicopy in a specified genome.




To generate this file, use the following `write_pangenome` subcommand:

```bash
ppanggolin write_pangenome -p pangenome.h5 --stats
```

Executing this command will also create the `genomes_statistics.tsv` file, detailed in the section labeled [here](#mean-persistent-duplication).


### Mean Persistent Duplication
#### Mean Persistent Duplication

The `mean_persistent_duplication.tsv` file lists the gene families along with their duplication ratios, average presence in the pangenome, and classification as 'single copy markers.' In this context, a gene family is not considered in single copy if it appears in single copy in less than 5% of the genomes by default. This default threshold can be adjusted using the `--dup_margin` parameter. The chosen threshold value for generating this file is indicated within a comment line starting with a '#'.

Expand Down Expand Up @@ -134,7 +119,7 @@ The flag `--stats` will also generate the `genomes_statistics.tsv` file desdcrib


(gene-presence-absence)=
### Gene Presence-Absence Matrix
#### Gene Presence-Absence Matrix

The `gene_presence_absence.Rtab` file represents a presence-absence matrix wherein columns are the genomes used to construct the pangenome, and rows correspond to gene families. Each gene family is identified by the identifier of their representative gene.

Expand All @@ -147,7 +132,7 @@ ppanggolin write_pangenome -p pangenome.h5 --Rtab
```


### Matrix File
#### Matrix File
The `matrix.csv` file, formatted as a .csv file, follows a structure similar to the `gene_presence_absence.csv` file generated by [Roary](https://sanger-pathogens.github.io/Roary/). This file format is compatible with [Scoary](https://github.com/AdmiralenOla/Scoary) for performing pangenome-wide association studies.

To generate this file, use the `write_pangenome` subcommand with the `--csv` flag:
Expand All @@ -158,7 +143,7 @@ ppanggolin write_pangenome -p pangenome.h5 --csv



### Partitions Files
#### Partitions Files

The 'Partitions' files are stored within the `partitions` directory and are named after the specific partition they represent (e.g., 'persistent.txt' for the persistent partition). Each file contains a list of gene family identifiers corresponding to the gene families belonging to that particular partition. The format consists of one family identifier per line, facilitating their usage in downstream analysis workflows.

Expand All @@ -167,7 +152,7 @@ To generate these files, use the `write_pangenome` subcommand with the `--partit
`ppanggolin write_pangenome -p pangenome.h5 --partitions`


### Gene Families to Genes Associations
#### Gene Families to Genes Associations

The `gene_families.tsv` file mirrors the format provided by [MMseqs2](https://github.com/soedinglab/MMseqs2) through its `createtsv` subcommand. This file structure comprises three columns: the gene family name in the first column, the gene names in the second, and a third column that remains empty or contains an "F" to denote potential gene fragments instead of complete genes. This indication appears only if the [defragmentation](./pangenomeCluster.md#defragmentation) pipeline has been used.

Expand Down

0 comments on commit 540174e

Please sign in to comment.