Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve tile plot figure #277

Merged
merged 14 commits into from
Sep 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed docs/_static/tile_plot.png
Binary file not shown.
Binary file added docs/_static/tile_plot_dendro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/tile_plot_no_dendro.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
23 changes: 17 additions & 6 deletions docs/user/PangenomeAnalyses/pangenomeFigures.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,26 +17,37 @@ ppanggolin draw -p pangenome.h5 --ucurve

#### Tile plot

A tile plot is similar to a heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (the color depends on the number of gene copies) and uncolored if absent. The gene families are ordered by partition and then by a hierarchical clustering, and the genomes are ordered by a hierarchical clustering based on their shared gene families (basically two genomes that are close together in terms of gene family composition will be close together on the figure).
A tile plot is a kind of heatmap representing the gene families (y-axis) in the genomes (x-axis) making up your pangenome. The tiles on the graph will be colored if the gene family is present in a genome (either in blue or red if the gene family has multiple gene copies) and uncolored if absent. The gene families are ordered by partition and then by their number of presences (increasing order), and the genomes are ordered by a hierarchical clustering based on their shared gene families via a Jaccard distance (basically two genomes that are close together in terms of gene family composition will be close together on the figure).

This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate to you the gene identifier(s), the gene family and the genome corresponding to the tile.
This plot is quite helpful to observe potential structures in your pangenome, and can help you identify eventual outliers. You can interact with it, and mousing over a tile in the plot will indicate the gene identifier(s), the gene family and the genome corresponding to the tile. As detailed below, additional metadata can be displayed.

If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 500 genomes, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise.
If you build your pangenome using a workflow subcommands (`all`, `workflow`, `panrgp`, `panmodule`) and you have more than 60k gene families, the plot will not be drawn; if you have more than 32 767 gene families, only the 'shell' and the 'persistent' partitions will be drawn, leaving out the 'cloud' as the figure tends to be too heavy for a browser to open it otherwise. Beyond the workflow subcommand, you can generate the plot with any number of gene families or genomes. However, no browser currently supports visualizing a tile plot containing more than 65 535 gene families or more than 65 535 genomes (for more information, refer to [this Stack Overflow discussion](https://stackoverflow.com/questions/78431835/plotly-heatmap-has-limit-on-data-size)
).

It can be generated using the 'draw' subcommand as such :
To generate a tile plot, use the 'draw' subcommand as follows:

```bash
ppanggolin draw -p pangenome.h5 --tile_plot
```

![tile plot figure](../../_static/tile_plot.png)
![Tile plot figure](../../_static/tile_plot_no_dendro.png)

and if you do not want the 'cloud' gene families as it is a lot of data and can be hard to open with a browser sometimes, you can use the following option :
If you prefer not to include 'cloud' gene families, which can make the plot difficult to render in a browser, you can use the `--nocloud` option:

```bash
ppanggolin draw -p pangenome.h5 --tile_plot --nocloud
```

To include a dendrogram on top of the tile plot, use the `--add_dendrogram` option:

```bash
ppanggolin draw -p pangenome.h5 --tile_plot --add_dendrogram
```

![Tile plot with dendrogram](../../_static/tile_plot_dendro.png)

If you have added metadata to the gene elements of your pangenome (see [metadata documentation](../metadata.md) for details), you can display this metadata in the hover text by using the `--add_metadata` argument.

#### Rarefaction curve
This figure is not drawn by default in the 'workflow' subcommand as it requires a lot of computations. It represents the evolution of the number of gene families for each partition as you add more genomes to the pangenome. It has been used a lot in the literature as an indicator of the diversity that you are missing with your dataset on your taxonomic group (Tettelin et al., 2005). The idea is that if at some point when you keep adding genomes to your pangenome you do not add any more gene families, you might have access to your entire taxonomic group's diversity. On the contrary, if you are still adding a lot of genes you may be still missing a lot of gene families.

Expand Down
32 changes: 29 additions & 3 deletions ppanggolin/figures/drawing.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,9 @@ def launch(args: argparse.Namespace):
pangenome = Pangenome()
pangenome.add_file(args.pangenome)
if args.tile_plot:
draw_tile_plot(pangenome, args.output, args.nocloud, disable_bar=args.disable_prog_bar)
draw_tile_plot(pangenome, args.output, args.nocloud, draw_dendrogram=args.add_dendrogram, disable_bar=args.disable_prog_bar,
add_metadata=args.add_metadata,
metadata_sources=args.metadata_sources)
if args.ucurve:
draw_ucurve(pangenome, args.output, soft_core=args.soft_core, disable_bar=args.disable_prog_bar)
if args.draw_spots:
Expand Down Expand Up @@ -81,8 +83,7 @@ def parser_draw(parser: argparse.ArgumentParser):
help="Output directory")
optional.add_argument("--tile_plot", required=False, default=False, action="store_true",
help="draw the tile plot of the pangenome")
optional.add_argument("--nocloud", required=False, default=False, action="store_true",
help="Do not draw the cloud in the tile plot")

optional.add_argument("--soft_core", required=False, default=0.95, type=restricted_float,
help="Soft core threshold to use")
optional.add_argument("--ucurve", required=False, default=False, action="store_true",
Expand All @@ -91,6 +92,31 @@ def parser_draw(parser: argparse.ArgumentParser):
help="draw plots for spots of the pangenome")
optional.add_argument("--spots", required=False, default='all', nargs='+',
help="a comma-separated list of spots to draw (or 'all' to draw all spots, or 'synteny' to draw spots with different RGP syntenies).")

optional.add_argument("--nocloud", required=False, default=False, action="store_true",
help="Do not draw the cloud genes in the tile plot")
optional.add_argument(
"--add_dendrogram",
required=False,
default=False,
action="store_true",
help="Include a dendrogram for genomes in the tile plot based on the presence/absence of gene families."
)

optional.add_argument(
"--add_metadata",
required=False,
default=False,
action="store_true",
help="Display gene metadata as hover text for each cell in the tile plot."
)

optional.add_argument("--metadata_sources",
default=None,
nargs="+",
help="Which source of metadata should be written in the tile plot. "
"By default all metadata sources are included.")



if __name__ == '__main__':
Expand Down
Loading