-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add pathogen-cluster-mutations command
Adds a new top-level pathogen-cluster-mutations command that allows users to pass a reference sequence, an alignment, and a table of cluster assignments and get a table of mutations found in each cluster at some minimum count and frequency. This is a direct port of the cluster_mutation.py script [1] from the cartography project. It exists as its own separate command so users (including us) can find mutations for any predefined genetic groups including clusters, Nextstrain clades, MCCs, Pango lineages, etc. Closes #20 [1] https://github.com/blab/cartography/blob/717deb1142e1a14f6b74c5a6e794e902dda4aa1c/notebooks/scripts/cluster_mutation.py
- Loading branch information
Showing
6 changed files
with
219 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,7 +7,7 @@ | |
|
||
setup( | ||
name='pathogen-embed', | ||
version='3.0.0', | ||
version='3.1.0', | ||
description='Reduced dimension embeddings for pathogen sequences', | ||
url='https://github.com/blab/pathogen-embed/', | ||
author='Sravani Nanduri <[email protected]> , John Huddleston <[email protected]>', | ||
|
@@ -54,7 +54,8 @@ | |
"console_scripts": [ | ||
"pathogen-embed = pathogen_embed.__main__:run_embed", | ||
"pathogen-distance = pathogen_embed.__main__:run_distance", | ||
"pathogen-cluster = pathogen_embed.__main__:run_cluster" | ||
"pathogen-cluster = pathogen_embed.__main__:run_cluster", | ||
"pathogen-cluster-mutations = pathogen_embed.__main__:run_cluster_mutations", | ||
] | ||
} | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
>U26830.1 Influenza A virus (A/Beijing/32/1992(H3N2)) hemagglutinin gene, complete cds | ||
ATGAAGACTATCATTGCTTTGAGCTACATTTTATGTCTGGTTTTCGCTCAAAAACTTCCCGGAAATGACA | ||
ACAGCACAGCAACGCTGTGCCTGGGACATCATGCAGTGCCAAACGGAACGCTAGTGAAAACAATCACGAA | ||
TGATCAAATTGAAGTGACTAATGCTACTGAGCTGGTTCAGAGTTCCTCAACAGGTAGAATATGCGACAGT | ||
CCTCACCGAATCCTTGATGGAAAAAACTGCACACTGATAGATGCTCTATTGGGAGACCCTCATTGTGATG | ||
GCTTCCAAAATAAGGAATGGGACCTTTTTGTTGAACGCAGCAAAGCTTACAGCAACTGTTACCCTTATGA | ||
TGTACCGGATTATGCCTCCCTTAGGTCACTAGTTGCCTCATCAGGCACCCTGGAGTTTATCAATGAAGAC | ||
TTCAATTGGACTGGAGTCGCTCAGGATGGGGGAAGCTATGCTTGCAAAAGGGGATCTGTTAACAGTTTCT | ||
TTAGTAGATTGAATTGGTTGCACAAATCAGAATACAAATATCCAGCGCTGAACGTGACTATGCCAAACAA | ||
TGGCAAATTTGACAAATTGTACATTTGGGGGGTTCACCACCCGAGCACGGACAGAGACCAAACCAGCCTA | ||
TATGTTCGAGCATCAGGGAGAGTCACAGTCTCTACCAAAAGAAGCCAACAAACTGTAACCCCGAATATCG | ||
GGTCTAGACCCTGGGTAAGGGGTCAGTCCAGTAGAATAAGCATCTATTGGACAATAGTAAAACCGGGAGA | ||
CATACTTTTGATTAATAGCACAGGGAATCTAATTGCTCCTCGGGGTTACTTCAAAATACGAAATGGGAAA | ||
AGCTCAATAATGAGGTCAGATGCACCCATTGGCACCTGCAGTTCTGAATGCATCACTCCAAATGGAAGCA | ||
TTCCCAATGACAAACCTTTTCAAAATGTAAACAGGATCACATATGGGGCCTGCCCCAGATATGTTAAGCA | ||
AAACACTCTGAAATTGGCAACAGGGATGCGGAATGTACCAGAGAAACAAACTAGAGGCATATTCGGCGCA | ||
ATCGCAGGTTTCATAGAAAATGGTTGGGAGGGAATGGTAGACGGTTGGTACGGTTTCAGGCATCAAAATT | ||
CTGAGGGCACAGGACAAGCAGCAGATCTTAAAAGCACTCAAGCAGCAATCGACCAAATCAACGGGAAACT | ||
GAATAGGTTAATCGAGAAAACGAACGAGAAATTCCATCAAATCGAAAAAGAATTCTCAGAAGTAGAAGGG | ||
AGAATTCAGGACCTCGAGAAATATGTTGAAGACACTAAAATAGATCTCTGGTCTTACAACGCGGAGCTTC | ||
TTGTTGCCCTGGAGAACCAACATACAATTGATCTTACTGACTCAGAAATGAACAAACTGTTTGAAAAAAC | ||
AAGGAAGCAACTGAGGGAAAATGCTGAGGACATGGGCAATGGTTGCTTCAAAATATACCACAAATGTGAC | ||
AATGCCTGCATAGGGTCAATCAGAAATGGAACTTATGACCATGATGTATACAGAGACGAAGCATTAAACA | ||
ACCGGTTCCAGATCAAAGGTGTTGAGCTGAAGTCAGGATACAAAGATTGGATCCTGTGGATTTCCTTTGC | ||
CATATCATGCTTTTTGCTTTGTGTTGTTTTGCTGGGGTTCATCATGTGGGCCTGCCAAAAAGGCAACATT | ||
AGGTGTAACATTTGCATTTGA |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
Run pathogen-embed with PCA on a H3N2 HA alignment. | ||
|
||
$ pathogen-embed \ | ||
> --alignment $TESTDIR/data/h3n2_ha_alignment.sorted.fasta \ | ||
> --output-dataframe embed.csv \ | ||
> pca \ | ||
> --components 2 | ||
|
||
Find clusters from the embedding. | ||
|
||
$ pathogen-cluster \ | ||
> --embedding embed.csv \ | ||
> --label-attribute cluster_label \ | ||
> --distance-threshold 0.5 \ | ||
> --output-dataframe cluster_embed.csv | ||
|
||
Find mutations per cluster. | ||
|
||
$ pathogen-cluster-mutations \ | ||
> --reference-sequence $TESTDIR/data/h3n2_ha_reference.fasta \ | ||
> --alignment $TESTDIR/data/h3n2_ha_alignment.sorted.fasta \ | ||
> --clusters cluster_embed.csv \ | ||
> --cluster-column cluster_label \ | ||
> --min-allele-count 10 \ | ||
> --min-allele-frequency 0.5 \ | ||
> --output mutations_cluster_embed.csv | ||
|
||
Confirm that the mutation table output has the correct structure and more than one row. | ||
|
||
$ head -n 1 mutations_cluster_embed.csv | ||
mutation,cluster_count,distinct_clusters,cluster_column | ||
|
||
$ [[ $(sed 1d mutations_cluster_embed.csv | wc -l | sed 's/ //g') > 0 ]] |