diff --git a/docs/Dockerfile b/docs/Dockerfile index e31a9643c..7bf5c1a7f 100644 --- a/docs/Dockerfile +++ b/docs/Dockerfile @@ -56,6 +56,7 @@ WORKDIR /workdir COPY environment.yml /workdir RUN set -euxo pipefail >/dev/null \ + && conda config --add channels conda-forge \ && conda env create -n "docs.clades.nextstrain.org" USER ${USER} diff --git a/docs/user/algorithm/03-mutation-calling.md b/docs/user/algorithm/03-mutation-calling.md deleted file mode 100644 index 6c809e5eb..000000000 --- a/docs/user/algorithm/03-mutation-calling.md +++ /dev/null @@ -1,19 +0,0 @@ -# 3. Mutation calling - -In order to detect nucleotide mutations, aligned nucleotide sequences are compared with the reference nucleotide sequence, one nucleotide at a time. Mismatches between the query and reference sequences are then noted and reported differently, depending on their nature: - -- Nucleotide substitutions: a change from one character to another. For example a change from `A` in the reference sequence to `G` in the query sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as colored markers, where color signifies the resulting character (in query sequence). - -- Nucleotide deletions ("gaps"): nucleotide was present in the reference sequence, but is not present in the query sequence. These are indicated by the "`-`" character in the alignment sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as dark-grey markers. In the output files deletions are represented as numeric ranges, signifying the start and end of the deleted fragment (for example: `21765-21770`) - -- Nucleotide insertions: additional nucleotides in the query sequence that were not present in the reference sequence. They are stripped from the alignment and reported separately, showing the position in the reference after which the insertion occurred and the fragment that was inserted. `22030:ACT` would indicate that the query sequence has the three bases `ACT` inserted between position `22030` and `22031` in the reference sequence (the indices are 1-based). - -Nextclade also gathers and reports other useful statistics, such as the number of contiguous ranges of `N` (missing) and non-ACGTN (ambiguous) nucleotides, as well as the total counts of substituted, deleted, missing and ambiguous nucleotides. You can find this information in the results table of [Nextclade Web](../nextclade-web) and in the output files of [Nextclade CLI](../nextclade-cli). - -Similarly, aminoacid mutations and statistics are gathered from the aligned peptides obtained after [translation](./02-translation). This step only runs if a [genome annotation](../input-files/03-genome-annotation) is provided. - -### Results - -The nucleotide mutations can be viewed in "Sequence view" column of the results table in [Nextclade Web](../nextclade-web). Switching "Sequence view" to a particular gene will show mutations in the corresponding peptide. - -The mutation calling step results in a set of mutations and various practical metrics for each sequence. They are produced as a part of the analysis results [JSON](../output-files/05-results-json), [CSV and TSV files](../output-files/04-results-tsv) in [Nextclade CLI](../nextclade-cli) and in the "Download" dialog of [Nextclade Web](../nextclade-web). diff --git a/docs/user/algorithm/05-phylogenetic-placement.md b/docs/user/algorithm/03-phylogenetic-placement.md similarity index 99% rename from docs/user/algorithm/05-phylogenetic-placement.md rename to docs/user/algorithm/03-phylogenetic-placement.md index f9735095e..24d7427a9 100644 --- a/docs/user/algorithm/05-phylogenetic-placement.md +++ b/docs/user/algorithm/03-phylogenetic-placement.md @@ -1,4 +1,4 @@ -# 5. Phylogenetic placement +# 3. Phylogenetic placement After reference alignment and mutation calling, Nextclade places each query sequence on the reference phylogenetic tree. diff --git a/docs/user/algorithm/06-clade-assignment.md b/docs/user/algorithm/04-clade-assignment.md similarity index 99% rename from docs/user/algorithm/06-clade-assignment.md rename to docs/user/algorithm/04-clade-assignment.md index a9ef7d5c5..ba244e7a2 100644 --- a/docs/user/algorithm/06-clade-assignment.md +++ b/docs/user/algorithm/04-clade-assignment.md @@ -1,4 +1,4 @@ -# 6. Clade assignment +# 4. Clade assignment To simplify discussion of co-circulating virus variants, viral diversity of is often broken down into [Clades](../terminology.html#clade) or lineages which are defined by specific combinations of signature mutations. Clades are groups of related sequences that share a common ancestor. For SARS-CoV-2, Nextclade can assign both broad clades defined by the Nextstrain team as well as more fine-grained lineages defined by the PANGO consortium. diff --git a/docs/user/algorithm/05-mutation-calling.md b/docs/user/algorithm/05-mutation-calling.md new file mode 100644 index 000000000..e799cd65f --- /dev/null +++ b/docs/user/algorithm/05-mutation-calling.md @@ -0,0 +1,83 @@ +# 5. Mutation calling + +Nextclade calls nucleotide and aminoacid mutations relative to multiple targets. + +### Mutations relative to reference sequence + +In order to detect nucleotide mutations, aligned nucleotide sequences are compared with the reference nucleotide sequence, one nucleotide at a time. Mismatches between the query and reference sequences are then noted and reported differently, depending on their nature: + +- Nucleotide substitutions: a change from one character to another. For example a change from `A` in the reference sequence to `G` in the query sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as colored markers, where color signifies the resulting character (in query sequence). + +- Nucleotide deletions ("gaps"): nucleotide was present in the reference sequence, but is not present in the query sequence. These are indicated by the "`-`" character in the alignment sequence. They are shown in sequence views in [Nextclade Web](../nextclade-web) as dark-grey markers. In the output files deletions are represented as numeric ranges, signifying the start and end of the deleted fragment (for example: `21765-21770`) + +- Nucleotide insertions: additional nucleotides in the query sequence that were not present in the reference sequence. They are stripped from the alignment and reported separately, showing the position in the reference after which the insertion occurred and the fragment that was inserted. `22030:ACT` would indicate that the query sequence has the three bases `ACT` inserted between position `22030` and `22031` in the reference sequence (the indices are 1-based). + +Nextclade also gathers and reports other useful statistics, such as the number of contiguous ranges of `N` (missing) and non-ACGTN (ambiguous) nucleotides, as well as the total counts of substituted, deleted, missing and ambiguous nucleotides. You can find this information in the results table of [Nextclade Web](../nextclade-web) and in the output files of [Nextclade CLI](../nextclade-cli). + +Similarly, aminoacid mutations and statistics are gathered from the aligned peptides obtained after [translation](./02-translation). This step only runs if a [genome annotation](../input-files/03-genome-annotation) is provided. + +### Private mutations + +Following the [tree placement](03-phylogenetic-placement.md), Nextclade identifies "private mutations" - the mutations between the query sequence and the sequence corresponding to the nearest neighbor (parent) on the tree. + +In the figure, the query sequence (dashed) is compared to all sequences (including internal nodes) of the reference tree to identify the nearest neighbor. The yellow and dark green mutations are private mutations, as they occur in addition to the 3 mutations of the attachment node. + +![Identification of private mutations](../assets/algo_private-muts.png) + +Many sequence quality problems are identifiable by the presence of private mutations. Sequences with unusually many private mutations are unlikely to be biological and are thus [flagged as bad](06-quality-control.md#private-mutations-p). + +Nextclade classifies private mutations further into 3 categories to be more sensitive to potential contamination, co-infection and recombination: + +1. Reversions: Private mutations that go back to the reference sequence, i.e. a mutation with respect to reference is present on the attachment node but not on the query sequence. +2. Labeled mutations: Private mutations to a genotype that is known to be common in a clade. +3. Unlabeled mutations: Private mutations that are neither reversions nor labeled. + +For an illustration of these 3 types, see the figure below. + +![Classification of private mutations](../assets/algo_private-muts-classification.png) + +Reversions are common artefacts in some bioinformatic pipelines when there is amplicon dropout and missing sequence is "fill-in" with the reference. +They are also a sign of contamination, co-infection or recombination. Labeled mutations are also a common sign of contamination, co-infection or recombination and deserve special attention. + +For some datasets, reversions and labeled mutations are therefore weighted several times higher than unlabeled mutations due to their higher sensitivity and specificity for quality problems (and recombination). +In February 2022, the SARS-CoV-2 dataset weighed every reversion 6 (`weightReversionSubstitutions`) while every labeled mutation was weighed 4 times (`weightLabeledSubstitutions`). Unlabeled mutations get weight 1 (`weightUnlabeledSubstitutions`). + +From the weighted sum, 8 (`typical`) is subtracted. The score is then a linear interpolation between 0 and 100 (and above), where 100 corresponds to 24 (`cutoff`). + +Private deletion ranges (including reversion) are currently counted as a single unlabeled substitution, but this could change in the future. + +### Clade founder search and mutations relative to clade founder + +For each query sample possessing a clade, Nextclade finds a corresponding "clade founder" node in the reference tree - the most ancestral node having the same clade. It starts with parent node (nearest node) obtained during [tree placement](03-phylogenetic-placement.md) and traverses the tree towards the root, until it finds the last node with the same clade as the parent node. + +After that Nextclade calls nucleotide and aminoacid mutations relative to the clade founder. + +The search and mutation calling happens separately for clades as well as for each custom clade-like attribute (unless excluded in the [pathogen config](../input-files/05-pathogen-config.md)). + +Clade founder search is a built-in convenience wrapper for a [node search and relative mutations](#arbitrary-node-search-and-relative-mutations) with pre-agreed search criteria (matching clades). + +> ⚠️ Nextclade assumes that all clades and clade-like attributes defined on the [input reference tree](../input-files/04-reference-tree.md) are [monophyletic](https://en.wikipedia.org/wiki/Monophyly). In this context it means that that all nodes belonging to one clade are a single connected component on the tree. Moreover, tree should be sufficiently large and diverse, such that early samples of each of the clades are well represented. Nextclade official datasets enforce these requirements, however third-party dataset authors and users of their datasets need to take additional care. + +### Arbitrary node search and relative mutations + +In addition to the built-in search for clade founder nodes (see above), [dataset](../datasets.md) authors may define criteria for arbitrary nodes of interest on the [reference tree](../input-files/04-reference-tree.md). Nextclade will then search these nodes, similarly to how it finds clade founder nodes, and will calculate mutations relative to each of these nodes. + +This could be useful, for example, for comparing sequences to the vaccine strains. + +### Results + +The mutation calling step results in a set of mutations and various practical metrics for each sequence. + +Mutations can be viewed in the last column of the results table in [Nextclade Web](../nextclade-web). + +The "Genetic feature" dropdown allows switching between nucleotide sequence and CDSes (if genome annotation is provided). The "Relative to" dropdown allows to select the target for comparison: + +- "Reference" - shows mutations relative to the [reference sequence](../input-files/02-reference-sequence.md) +- "Parent" - shows private mutations, i.e. mutations relative to the parent (nearest) node +- "Clade founder" - shows mutations relative to clade founder +- " founder" - shows mutations relative to clade-like attribute founder (if any defined) +- any additional entries show mutations relative to the node(s) found according to the custom search criteria (if any defined) + +The "Mut" column shows total number of nucleotide mutations and its mouseover tooltip lists the mutations. + +All results are emitted into the output [JSON](../output-files/05-results-json), [CSV and TSV files](../output-files/04-results-tsv) in [Nextclade CLI](../nextclade-cli) and in the "Export" dialog of [Nextclade Web](../nextclade-web). diff --git a/docs/user/algorithm/07-quality-control.md b/docs/user/algorithm/06-quality-control.md similarity index 66% rename from docs/user/algorithm/07-quality-control.md rename to docs/user/algorithm/06-quality-control.md index bdb8fd77b..0ab5891dd 100644 --- a/docs/user/algorithm/07-quality-control.md +++ b/docs/user/algorithm/06-quality-control.md @@ -1,4 +1,4 @@ -# 7. Quality Control (QC) +# 6. Quality Control (QC) [Whole-genome sequencing](https://en.wikipedia.org/wiki/Whole_genome_sequencing) of viruses is a complex biotechnological process. Results can vary significantly in their quality, in particular, from scarce or degraded input material. Some parts of the sequence might be missing and the bioinformatic analysis pipelines that turn raw data into a consensus genome sometimes produce artefacts. Such artefacts typically manifest in spurious differences of the sequence from the reference. @@ -11,7 +11,7 @@ Nextclade scans each query sequence for issues which may indicate problems occur For each query sequence each individual QC rule produces a quality score. These **individual QC scores** are empirically calibrated to fit the following thresholds: | Score | Meaning | Color designation | -| ------------- | ------------------ | ----------------- | +|---------------|--------------------|-------------------| | 0 to 29 | "good" quality | green | | 30 to 99 | "mediocre" quality | yellow | | 100 and above | "bad" quality | red | @@ -43,36 +43,7 @@ Ambiguous nucleotides (such as `R`, `Y`, etc) are often indicative of contaminat ### Private mutations (P) -In order to assign clades, Nextclade places sequences on a reference tree that is representative of the global phylogeny (see figure below). The query sequence (dashed) is compared to all sequences (including internal nodes) of the reference tree to identify the nearest neighbor. - -As a by-product of this placement, Nextclade identifies the mutations, called "private mutations", that differ between the query sequence and the nearest neighbor sequence. In the figure, the yellow and dark green mutations are private mutations, as they occur in addition to the 3 mutations of the attachment node. - -![Identification of private mutations](../assets/algo_private-muts.png) - -Many sequence quality problems are identifiable by the presence of private mutations. Sequences with unusually many private mutations are unlikely to be biological and are thus flagged as bad. - -Since web version 1.13.0 (CLI 1.10.0), Nextclade classifies private mutations further into 3 categories to be more sensitive to potential contamination, co-infection and recombination: - -1. Reversions: Private mutations that go back to the reference sequence, i.e. a mutation with respect to reference is present on the attachment node but not on the query sequence. -2. Labeled mutations: Private mutations to a genotype that is known to be common in a clade. -3. Unlabeled mutations: Private mutations that are neither reversions nor labeled. - -For an illustration of these 3 types, see the figure below. - -![Classification of private mutations](../assets/algo_private-muts-classification.png) - -Reversions are common artefacts in some bioinformatic pipelines when there is amplicon dropout. -They are also a sign of contamination, co-infection or recombination. Labeled mutations also contain commonly when there's contamination, co-infection or recombination. - -Reversions and labeled mutations are weighted several times higher than unlabeled mutations due to their higher sensitivity and specificity for quality problems (and recombination). -In February 2022, every reversion was counted 6 times (`weightReversionSubstitutions`) while every labeled mutation was counted 4 times (`weightLabeledSubstitutions`). Unlabeled mutations get weight 1 (`weightUnlabeledSubstitutions`). - -From the weighted sum, 8 (`typical`) is subtracted. The score is then a linear interpolation between 0 and 100 (and above), where 100 corresponds to 24 (`cutoff`). - -Private deletion ranges (including reversion) are currently counted as a single unlabeled substitution, but this could change in the future. - -Which genotypes get "labeled" is determined in the dataset config file `virus_properties.json` which can also be found in the [Github repo](https://github.com/nextstrain/nextclade_data/blob/master/data/datasets/sars-cov-2/references/MN908947/versions/2022-02-07T12:00:00Z/files/virus_properties.json). -Currently, all mutations that appear in at least 30% of the sequences of a clade or in at least 100k sequences in a clade get that clade's label. +[Private mutations](05-mutation-calling.md#private-mutations) may indicate sequencing errors or unusual variants. ### Mutation clusters (C) diff --git a/docs/user/algorithm/04-pcr-primer-changes-detection.md b/docs/user/algorithm/07-pcr-primer-changes-detection.md similarity index 97% rename from docs/user/algorithm/04-pcr-primer-changes-detection.md rename to docs/user/algorithm/07-pcr-primer-changes-detection.md index cd0c113ab..b62f6b334 100644 --- a/docs/user/algorithm/04-pcr-primer-changes-detection.md +++ b/docs/user/algorithm/07-pcr-primer-changes-detection.md @@ -1,4 +1,4 @@ -# 4. Detection of PCR primer changes +# 7. Detection of PCR primer changes [Polymerase chain reactions (PCR)](https://en.wikipedia.org/wiki/Polymerase_chain_reaction) uses small nucleotide sequence snippets called "primers" that are [complementary]() to a specific region of the virus genome. High similarity between primers and the genome region they are supposed to bind to is required for PCR to work. Changes in the virus genome can interfere with this requirement. If Nextclade is provided with a table of PCR primers in the pathogen metadata file, it can analyze these regions in query sequences and report changes that may indicate reduced primer binding. diff --git a/docs/user/algorithm/index.rst b/docs/user/algorithm/index.rst index e140e9526..4eaa0b451 100644 --- a/docs/user/algorithm/index.rst +++ b/docs/user/algorithm/index.rst @@ -10,8 +10,8 @@ Internally, Nextclade is implemented as a parallel pipeline which consists of se 01-sequence-alignment.md 02-translation.md - 03-mutation-calling.md - 04-pcr-primer-changes-detection.md - 05-phylogenetic-placement.md - 06-clade-assignment.md - 07-quality-control.md + 03-phylogenetic-placement.md + 04-clade-assignment.md + 05-mutation-calling.md + 06-quality-control.md + 07-pcr-primer-changes-detection.md diff --git a/docs/user/input-files/03-genome-annotation.md b/docs/user/input-files/03-genome-annotation.md index 05d2c94f6..199f7c37c 100644 --- a/docs/user/input-files/03-genome-annotation.md +++ b/docs/user/input-files/03-genome-annotation.md @@ -6,7 +6,7 @@ The annotation is required for codon-aware alignment, for translation of CDS (Co Accepted formats: [GFF3](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3%2Emd). -Since version 3, Nextclade supports multi-fragment CDSs which enable the correct translation of complex features including programmed ribosomal slippage (e.g. ORF1ab in SARS-CoV-2), genes crossing the origin of a circular genome (e.g. Hepatitis B virus) and CDS that require splicing (e.g. HIV). +Nextclade supports multi-fragment CDSs which enable the correct translation of complex features including programmed ribosomal slippage (e.g. ORF1ab in SARS-CoV-2), genes crossing the origin of a circular genome (e.g. Hepatitis B virus) and CDS that require splicing (e.g. HIV). Almost any syntactically correct [spec](https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3%2Emd)-compliant GFF3 annotation (e.g. downloaded from Genbank) should work. In practice, because GFF3 format allows for great freedom of how to express features as well as how to interpret them, some processing may be required to make it work satisfactory in Nextclade. diff --git a/docs/user/input-files/04-reference-tree.md b/docs/user/input-files/04-reference-tree.md index b4e85a36e..12d56e665 100644 --- a/docs/user/input-files/04-reference-tree.md +++ b/docs/user/input-files/04-reference-tree.md @@ -22,17 +22,22 @@ Auspice JSON trees prepared for usage in Nextclade can contain a set of extensio #### Clade-like attributes -For organisms with multiple concurrent nomenclatures (clades, lineages, variants etc.), in addition to clades (see [Algorithm: Clade Assignment](../algorithm/06-clade-assignment)), dataset authors can choose to add extra clade-like attributes. +For organisms with multiple concurrent nomenclatures (clades, lineages, variants etc.), in addition to clades (see [Algorithm: Clade Assignment](../algorithm/04-clade-assignment.md)), dataset authors can choose to add extra clade-like attributes. -The clade-like attributes behave like built-in clades and are copied from the nearest node along with them. +The clade-like attributes behave like built-in clades (`.node_attrs.clade_membership` in every node) and are copied from the nearest node along with it. Each declared attribute will result in a new column in the results table in Nextclade Web and in TSV/CSV output files, as well as a set of corresponding fields in the output JSON/NDJSON and output tree (the newly placed nodes). +Additionally, each of the attributes, unless excluded, participates in [founder node search](../algorithm/05-mutation-calling.md). For each attribute, Nextclade Web will display in the "Relative to" dropdown an additional entry named "'' founder", and a set of columns/fields `founderMuts` will be added to the [outputs](../output-files/04-results-tsv.md). + + As a dataset author, in order to add clade-like attributes to your reference tree, modify the reference tree file as follows: -1. Add field `.meta.extensions.nextclade.clade_node_attrs` of array type, and declare the clade-like attributes you want to add in this format: +1. Add field `.meta.extensions.nextclade.clade_node_attrs` of array type, and declare the clade-like attributes you want to add. - ```json5 + Example (for latest examples see [nextstrain/nextclade_data](https://github.com/nextstrain/nextclade_data)): + + ```json { "meta": { "extensions": { @@ -42,13 +47,15 @@ As a dataset author, in order to add clade-like attributes to your reference tre "name": "other-clade", "displayName": "Other clade", "description": "This long text goes into the tooltip. Explain what the clades are, who and where defined them.", - "hideInWeb": false + "hideInWeb": false, + "skipAsReference": true }, { "name": "my-lineage", "displayName": "My lineage", "description": "This long text goes into the tooltip. Explain what the lineages are, who and where defined them.", - "hideInWeb": false + "hideInWeb": false, + "skipAsReference": true } ] } @@ -57,10 +64,16 @@ As a dataset author, in order to add clade-like attributes to your reference tre } ``` + Fields: + - `name` - (required) machine-readable identifier of the attribute. Should match the attribute on the tree nodes. Will be used to name fields/columns in JSON and TSV output files. + - `displayName` - (optional) human-friendly name of the attribute. Will be shown in Nextclade Web. + - `description` - (optional) human-friendly description of the attribute. Will be shown in Nextclade Web. + - `hideInWeb` - (optional) set this to `true` to hide attribute's column from Nextclade Web + - `skipAsReference` - (optional) - set this to `true` to no use the attribute for calculating clade founder nodes and relative mutations. + 2. For each node in the tree, add node attribute with the same name as the `name` field in the attribute's description and with the value corresponding to the value of the clade, lineage etc. of this node: - ```json5 - // inside each node + ```json { "node_attrs": { "clade_membership": {"value": "A1"}, @@ -69,10 +82,63 @@ As a dataset author, in order to add clade-like attributes to your reference tre } } ``` - - Note that `clade_membership` attribute is treated separately (if present) and it does not need to be declared in `clade_node_attrs`. + + Note that `clade_membership` attribute is treated separately (if present) and it does not need to be declared in `clade_node_attrs`. 3. Now when running Nextclade with this tree, you will notice additional columns in the outputs. Each entry in a column for a clade-like attribute corresponds to a clade value assigned to the query sequence. + For concrete examples of using clade-like attributes, check out official SARS-CoV-2 datasets: they assign Nextstrain clades, Pango lineages and WHO VOC/VOIs simultaneously. + +#### Relative mutations + +Add object under `.meta.extensions.nextclade.ref_nodes`: + +```json +{ + "ref_nodes": { + "default": "__root__", + "search": [ + { + "name": "JN.1", + "displayName": "JN.1 (24A)", + "description": "Variant recommended for the 2024/2025 COVID-19 vaccine", + "criteria": [ + { + "qry": [ + { + "clade": ["23I", "24A", "24B", "24C", "recombinant"] + } + ], + "node": [ + { + "name": ["JN.1"] + } + ] + } + ] + } + ] + } +} +``` + +Properties: + +- `default`: string, optional. Set default search to display in the Nextclade Web dropdown. Should correspond to one of the `search[].name` fields or one of the special values `__root__` for reference sequence (default), `__parent__` for nearest node (private mutations), `__clade_founder__` for founder of the clade. + +- `search`: array of objects, optional. Each object describes one search. Each search corresponds to an entry in the "Relative to" dropdown in the web app and a set of CSV/TSV columns `relativeMutations['searchName']`. Note that these names no longer need to correspond to node names. + - `search[].name`: required unique identifier of the search entry + - `search[].displayName`, `search.description`: optional friendly name and description to be displayed in the UI (dropdown) + - `search[].criteria`: array of objects, optional. One or multiple search criteria. Criteria should be described such that during search run only one criterion matches a pair of query and node. If there are multiple matches, then one (unspecified) match is taken and a warning is emitted. + - `search[].criteria[].qry`: object, optional, describing properties of query samples to select for this search + - `search[].criteria[].qry.clade`: array of strings, optional. Query names to consider for this search. At least one match is necessary for sample to match. + - `search[].criteria[].qry.cladeNodeAttrs`: optional mapping from name of the clade-like attr to a list of searched values for this attr. At least one match is necessary for sample to match. + - `search[].criteria[].node`: object, optional, describing properties of ref node to search, as well as search algorithm. All of the properties should match. + - `search[].criteria[].node.name`: array of strings, optional. Searched node names. At least one match. + - `search[].criteria[].node.clade`: array of strings, optional. Searched node clades. At least one match is necessary for node to match. + - `search[].criteria[].node.cladeNodeAttrs`: optional mapping from name of the clade-like attr to a list of searched values for this attr. At least one match is necessary for node to match. + - `search[].criteria[].node.searchAlgo`: string, optional. Search algorithm to use + - `full` (default): simple loop over all nodes until first match is found + - `ancestor-earliest`: start with the current sample and traverse the graph against edge directions, looking for matching nodes, until it reaches root node. The result is the last encountered matching node. + - `ancestor-latest`: start with the current sample and traverse the graph against edge directions, looking for matching nodes. The first match is the result. -For concrete examples of using clade-like attributes, check out official SARS-CoV-2 datasets: they assign Nextstrain clades, Pango lineages and WHO VOC/VOIs simultaneously. diff --git a/docs/user/nextclade-web/phylogenetic-tree-view.md b/docs/user/nextclade-web/phylogenetic-tree-view.md index c96f4faf0..6113b5b56 100644 --- a/docs/user/nextclade-web/phylogenetic-tree-view.md +++ b/docs/user/nextclade-web/phylogenetic-tree-view.md @@ -5,7 +5,7 @@ In order to assign clades to sequences, Nextclade [places](../algorithm/05-phylo The tree is visualized by [Nextstrain Auspice](https://docs.nextstrain.org/projects/auspice/en/stable/). By default, only your uploaded sequences are highlighted. ![Tree with new sequences](../assets/web_tree.png) -Since Nextclade v3, Nextclade runs a greedy parsimony tree builder on user provided sequences. This means that approximate ancestral relationships between your sequences are visible on the tree. Given the simplicity of the tree builder, the tree is not guaranteed to be optimal. In the screenshot below, all but the 3 grey sequences are user provided. Nextclade has grouped related user provided sequences into clusters, based on shared mutations. +Nextclade runs a greedy parsimony tree builder on user provided sequences. This means that approximate ancestral relationships between your sequences are visible on the tree. Given the simplicity of the tree builder, the tree is not guaranteed to be optimal. In the screenshot below, all but the 3 grey sequences are user provided. Nextclade has grouped related user provided sequences into clusters, based on shared mutations. ![Nextclade tree builder](../assetts/../assets/web_tree-builder.png) diff --git a/docs/user/output-files/04-results-tsv.md b/docs/user/output-files/04-results-tsv.md index f3f5f6c85..076e5c4f6 100644 --- a/docs/user/output-files/04-results-tsv.md +++ b/docs/user/output-files/04-results-tsv.md @@ -19,76 +19,91 @@ TSV and CSV files are equivalent and only differ in the column delimiter (tabs v Every row in tabular output corresponds to 1 input sequence. The meaning of columns is described below: -| Column name | Meaning | type | Example | -|-------------------------------------------------|-------------------------------------------------------------------------------------------------------------|---------------------------------|----------------------------------| -| index | Index (integer signifying location) of a corresponding record in the input fasta file(s) | non-negative integer | 0 | -| seqName | Name of the sequence (as provided in the input file) | string | hCoV-19/USA/SEARCH-4652-SAN/2020 | -| clade | Assigned clade | string | 20A | -| qc.overallScore | Overall [quality control](../algorithm/07-quality-control) score | float | 23.5 | -| qc.overallStatus | Overall [quality control](../algorithm/07-quality-control) status | string: `good\|mediocre\|bad` | mediocre | -| totalSubstitutions | Total number of detected nucleotide substitutions | non-negative integer | 2 | -| totalDeletions | Total number of deleted nucleotide bases | non-negative integer | 15 | -| totalInsertions | Total number of inserted nucleotide bases | non-negative integer | 3 | -| totalFrameShifts | Total number of detected frame shifts | non-negative integer | 0 | -| totalAminoacidSubstitutions | Total number of detected aminoacid substitutions | non-negative integer | 1 | -| totalAminoacidDeletions | Total number of deleted amino acid residues | non-negative integer | 7 | -| totalAminoacidInsertions | Total number of inserted amino acid residues | non-negative integer | 8 | -| totalMissing | Total number of detected missing nucleotides (nucleotide character `N`) | non-negative integer | 238 | -| totalNonACGTNs | Total number of detected ambiguous nucleotides (nucleotide characters that are not `A`, `C`, `G`, `T`, `N`) | non-negative integer | 2 | -| totalUnknownAa | Total number of unknown aminoacids (aminoacid character `X`) | non-negative integer | 0 | -| totalPcrPrimerChanges | Total number of nucleotide mutations detected in PCR primer regions | non-negative integer | 0 | -| substitutions | List of detected nucleotide substitutions | comma separated list of strings | C241T,C2061T,C11514T,G23012A | -| deletions | List of detected nucleotide deletion ranges | comma separated list of strings | 201,28881-28882 | -| insertions | List of detected inserted nucleotide fragments | comma separated list of strings | 248:G,21881:GAG | -| privateNucMutations.reversionSubstitutions | List of detected private mutations that are reversions to reference | comma separated list of strings | C241T | -| privateNucMutations.labeledSubstitutions | List of detected private mutations that are to a genotype that has been labeled in `virus_properties.json` | comma separated list of strings | C11514T\|21I&20C,C2061T\|21E | -| privateNucMutations.unlabeledSubstitutions | List of detected private mutations that are neither reversions nor labeled | comma separated list of strings | G23012A | -| privateNucMutations.totalReversionSubstitutions | Total number of private mutations that are reversions to reference | non-negative integer | 1 | -| privateNucMutations.totalLabeledSubstitutions | Total number of private mutations that are to a genotype that has been labeled in `virus_properties.json` | non-negative integer | 2 | -| privateNucMutations.totalUnlabeledSubstitutions | Total number of private mutations that are neither reversions nor labeled | non-negative integer | 1 | -| privateNucMutations.totalPrivateSubstitutions | Total number of private mutations overall | non-negative integer | 4 | -| frameShifts | List of detected frame shifts | comma separated list of strings | N:33-420 | -| aaSubstitutions | List of detected aminoacid substitutions | comma separated list of strings | E:T9I,N:R203K | -| aaDeletions | List of detected aminoacid deletions | comma separated list of strings | N:E31-,N:E32- | -| aaInsertions | List of detected aminoacid insertions | comma separated list of strings | S:214:EPE | -| missing | List of detected missing nucleotides (nucleotide character `N`) | comma separated list of strings | 704-726,4248 | -| nonACGTNs | List of detected ambiguous nucleotides (nucleotide characters that are not `A`, `C`, `G`, `T`, `N`) | comma separated list of strings | Y:27948,K:3877 | -| unknownAaRanges | List of detected contiguous ranges of unknown aminoacid (aminoacid character `X`) | comma separated list of strings | E:1-12,E:29 | -| pcrPrimerChanges | List of detected PCR primer changes | comma separated list of strings | | -| alignmentScore | Alignment score | non-negative integer | 88237 | -| alignmentStart | Beginning of the sequenced region | non-negative integer | 1 | -| alignmentEnd | End of the sequenced region | non-negative integer | 29903 | -| qc.missingData.missingDataThreshold | Threshold that was used for "Missing data" QC rule | int | 3000 | -| qc.missingData.score | Score for "Missing data" QC rule | float | 0.5 | -| qc.missingData.status | Status for "Missing data" QC rule | string: `good\|mediocre\|bad` | mediocre | -| qc.missingData.totalMissing | Total number of missing nucleotides used in "Missing data" QC rule | non-negative integer | 238 | -| qc.mixedSites.mixedSitesThreshold | Threshold used for "Mixed sites" QC rule | int | 10 | -| qc.mixedSites.score | Score for "Mixed sites" QC rule | float | 0.5 | -| qc.mixedSites.status | Status for "Mixed sites" QC rule | string: `good\|mediocre\|bad` | good | -| qc.mixedSites.totalMixedSites | Total number of ambiguous nucleotides used for "Mixed sites" QC rule | non-negative integer | 2 | -| qc.privateMutations.cutoff | Cutoff parameter used for "Private mutations" QC rule | int | 3 | -| qc.privateMutations.excess | Excess parameter used for "Private mutations" QC rule | int | 1 | -| qc.privateMutations.score | Score for "Private mutations" QC rule | float | 0.5 | -| qc.privateMutations.status | Status for "Private mutations" QC rule | string: `good\|mediocre\|bad` | good | -| qc.privateMutations.total | Weighted sum of private mutations used for "Private mutations" QC rule | non-negative integer | 4 | -| qc.snpClusters.clusteredSNPs | Clustered SNP detected for "SNP clusters" QC rule | comma separated list of strings | C241T,C2061T | -| qc.snpClusters.score | Score for "SNP clusters" QC rule | float | 0.5 | -| qc.snpClusters.status | Status for "SNP clusters" QC rule | string: `good\|mediocre\|bad` | bad | -| qc.snpClusters.totalSNPs | Total number of SNPs for "SNP clusters" QC rule | non-negative integer | 2 | -| qc.frameShifts.frameShifts | List of detected frame shifts in "Frame shifts" QC rule (excluding ignored) | comma separated list of strings | N:33-420 | -| qc.frameShifts.totalFrameShifts | Total number of detected frame shifts in for "Frame shifts" QC rule (excluding ignored) | non-negative integer | 1 | -| qc.frameShifts.frameShiftsIgnored | List of frame shifts detected, but ignored due to ignore list | comma separated list of strings | ORF8:109-111 | -| qc.frameShifts.totalFrameShiftsIgnored | Total number of frame shifts detected, but ignored due to ignore list | non-negative integer | 1 | -| qc.frameShifts.score | Score for "Frame shifts" QC rule | float | 0.5 | -| qc.frameShifts.status | Status for "Frame shifts" QC rule | string: `good\|mediocre\|bad` | bad | -| qc.stopCodons.stopCodons | List of detected stop codons in "Stop codons" QC rule | comma separated list of strings | ORF1a:4715,ORF1a:4716 | -| qc.stopCodons.totalStopCodons | Total number of detected stop codons in "Stop codons" QC rule | non-negative integer | 2 | -| qc.stopCodons.score | Score for "Stop codons" QC rule | float | 0.5 | -| qc.stopCodons.status | Status for "Stop codons" QC rule | string: `good\|mediocre\|bad` | bad | -| isReverseComplement | Whether query sequences were transformed using reverse complement operation before alignment | boolean | false | -| errors | List of errors during processing | comma separated list of strings | | -| warnings | List of warnings during processing | comma separated list of strings | | -| failedCdses | List of CDS that failed translation | comma separated list of strings | | +| Column name | Meaning | type | Example | +|-------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------|----------------------------------| +| index | Index (integer signifying location) of a corresponding record in the input fasta file(s) | non-negative integer | 0 | +| seqName | Name of the sequence (as provided in the input file) | string | hCoV-19/USA/SEARCH-4652-SAN/2020 | +| clade | Assigned clade | string | 20A | +| qc.overallScore | Overall [quality control](../algorithm/07-quality-control) score | float | 23.5 | +| qc.overallStatus | Overall [quality control](../algorithm/07-quality-control) status | string: `good\|mediocre\|bad` | mediocre | +| totalSubstitutions | Total number of detected nucleotide substitutions | non-negative integer | 2 | +| totalDeletions | Total number of deleted nucleotide bases | non-negative integer | 15 | +| totalInsertions | Total number of inserted nucleotide bases | non-negative integer | 3 | +| totalFrameShifts | Total number of detected frame shifts | non-negative integer | 0 | +| totalAminoacidSubstitutions | Total number of detected aminoacid substitutions | non-negative integer | 1 | +| totalAminoacidDeletions | Total number of deleted amino acid residues | non-negative integer | 7 | +| totalAminoacidInsertions | Total number of inserted amino acid residues | non-negative integer | 8 | +| totalMissing | Total number of detected missing nucleotides (nucleotide character `N`) | non-negative integer | 238 | +| totalNonACGTNs | Total number of detected ambiguous nucleotides (nucleotide characters that are not `A`, `C`, `G`, `T`, `N`) | non-negative integer | 2 | +| totalUnknownAa | Total number of unknown aminoacids (aminoacid character `X`) | non-negative integer | 0 | +| totalPcrPrimerChanges | Total number of nucleotide mutations detected in PCR primer regions | non-negative integer | 0 | +| substitutions | List of detected nucleotide substitutions | comma separated list of strings | C241T,C2061T,C11514T,G23012A | +| deletions | List of detected nucleotide deletion ranges | comma separated list of strings | 201,28881-28882 | +| insertions | List of detected inserted nucleotide fragments | comma separated list of strings | 248:G,21881:GAG | +| frameShifts | List of detected frame shifts | comma separated list of strings | N:33-420 | +| aaSubstitutions | List of detected aminoacid substitutions | comma separated list of strings | E:T9I,N:R203K | +| aaDeletions | List of detected aminoacid deletions | comma separated list of strings | N:E31-,N:E32- | +| aaInsertions | List of detected aminoacid insertions | comma separated list of strings | S:214:EPE | +| missing | List of detected missing nucleotides (nucleotide character `N`) | comma separated list of strings | 704-726,4248 | +| nonACGTNs | List of detected ambiguous nucleotides (nucleotide characters that are not `A`, `C`, `G`, `T`, `N`) | comma separated list of strings | Y:27948,K:3877 | +| unknownAaRanges | List of detected contiguous ranges of unknown aminoacid (aminoacid character `X`) | comma separated list of strings | E:1-12,E:29 | +| pcrPrimerChanges | List of detected PCR primer changes | comma separated list of strings | | +| alignmentScore | Alignment score | non-negative integer | 88237 | +| alignmentStart | Beginning of the sequenced region | non-negative integer | 1 | +| alignmentEnd | End of the sequenced region | non-negative integer | 29903 | +| privateNucMutations.reversionSubstitutions | List of detected private mutations that are reversions to reference | comma separated list of strings | C241T | +| privateNucMutations.labeledSubstitutions | List of detected private mutations that are to a genotype that has been labeled in `virus_properties.json` | comma separated list of strings | C11514T\|21I&20C,C2061T\|21E | +| privateNucMutations.unlabeledSubstitutions | List of detected private mutations that are neither reversions nor labeled | comma separated list of strings | G23012A | +| privateNucMutations.totalReversionSubstitutions | Total number of private mutations that are reversions to reference | non-negative integer | 1 | +| privateNucMutations.totalLabeledSubstitutions | Total number of private mutations that are to a genotype that has been labeled in `virus_properties.json` | non-negative integer | 2 | +| privateNucMutations.totalUnlabeledSubstitutions | Total number of private mutations that are neither reversions nor labeled | non-negative integer | 1 | +| privateNucMutations.totalPrivateSubstitutions | Total number of private mutations overall | non-negative integer | 4 | +| founderMuts\['clade'\].nodeName | Clade founder node name on reference tree | string | hCoV-19/USA/SEARCH-4652-SAN/2020 | +| founderMuts\['clade'\].substitutions | List of detected nucleotide substitutions relative to clade founder | comma separated list of strings | A123T,C456G | +| founderMuts\['clade'\].deletions | List of detected nucleotide deletions relative to clade founder | comma separated list of strings | 10-15,44-55 | +| founderMuts\['clade'\].aaSubstitutions | List of detected aminoacid substitutions relative to clade founder | comma separated list of strings | E:T9I,N:R203K | +| founderMuts\['clade'\].aaDeletions | List of detected aminoacid deletions relative to clade founder | comma separated list of strings | N:E31-,N:E32- | +| founderMuts\[''\].nodeName | Node name of the founder of each clade-like attribute on reference tree | string | hCoV-19/USA/SEARCH-4652-SAN/2020 | +| founderMuts\[''\].substitutions | List of detected nucleotide substitutions relative to founder of each clade-like attribute | comma separated list of strings | A123T,C456G | +| founderMuts\[''\].deletions | List of detected nucleotide deletions relative to founder of each clade-like attribute | comma separated list of strings | 10-15,44-55 | +| founderMuts\[''\].aaSubstitutions | List of detected aminoacid substitutions relative to founder of each clade-like attribute | comma separated list of strings | E:T9I,N:R203K | +| founderMuts\[''\].aaDeletions | List of detected aminoacid deletions relative to founder of each clade-like attribute | comma separated list of strings | N:E31-,N:E32- | +| relativeMutations\[''\].nodeName | Name of node of interest found on reference tree according to [custom search criteria](../algorithm/05-mutation-calling#arbitrary-node-search-and-relative-mutations) | string | hCoV-19/USA/SEARCH-4652-SAN/2020 | +| relativeMutations\[''\].substitutions | List of detected nucleotide substitutions relative to the node of interest | comma separated list of strings | A123T,C456G | +| relativeMutations\[''\].deletions | List of detected nucleotide deletions relative to the node of interest | comma separated list of strings | 10-15,44-55 | +| relativeMutations\[''\].aaSubstitutions | List of detected aminoacid substitutions relative to the node of interest | comma separated list of strings | E:T9I,N:R203K | +| relativeMutations\[''\].aaDeletions | List of detected aminoacid deletions relative to the node of interest | comma separated list of strings | N:E31-,N:E32- | +| qc.missingData.missingDataThreshold | Threshold that was used for "Missing data" QC rule | int | 3000 | +| qc.missingData.score | Score for "Missing data" QC rule | float | 0.5 | +| qc.missingData.status | Status for "Missing data" QC rule | string: `good\|mediocre\|bad` | mediocre | +| qc.missingData.totalMissing | Total number of missing nucleotides used in "Missing data" QC rule | non-negative integer | 238 | +| qc.mixedSites.mixedSitesThreshold | Threshold used for "Mixed sites" QC rule | int | 10 | +| qc.mixedSites.score | Score for "Mixed sites" QC rule | float | 0.5 | +| qc.mixedSites.status | Status for "Mixed sites" QC rule | string: `good\|mediocre\|bad` | good | +| qc.mixedSites.totalMixedSites | Total number of ambiguous nucleotides used for "Mixed sites" QC rule | non-negative integer | 2 | +| qc.privateMutations.cutoff | Cutoff parameter used for "Private mutations" QC rule | int | 3 | +| qc.privateMutations.excess | Excess parameter used for "Private mutations" QC rule | int | 1 | +| qc.privateMutations.score | Score for "Private mutations" QC rule | float | 0.5 | +| qc.privateMutations.status | Status for "Private mutations" QC rule | string: `good\|mediocre\|bad` | good | +| qc.privateMutations.total | Weighted sum of private mutations used for "Private mutations" QC rule | non-negative integer | 4 | +| qc.snpClusters.clusteredSNPs | Clustered SNP detected for "SNP clusters" QC rule | comma separated list of strings | C241T,C2061T | +| qc.snpClusters.score | Score for "SNP clusters" QC rule | float | 0.5 | +| qc.snpClusters.status | Status for "SNP clusters" QC rule | string: `good\|mediocre\|bad` | bad | +| qc.snpClusters.totalSNPs | Total number of SNPs for "SNP clusters" QC rule | non-negative integer | 2 | +| qc.frameShifts.frameShifts | List of detected frame shifts in "Frame shifts" QC rule (excluding ignored) | comma separated list of strings | N:33-420 | +| qc.frameShifts.totalFrameShifts | Total number of detected frame shifts in for "Frame shifts" QC rule (excluding ignored) | non-negative integer | 1 | +| qc.frameShifts.frameShiftsIgnored | List of frame shifts detected, but ignored due to ignore list | comma separated list of strings | ORF8:109-111 | +| qc.frameShifts.totalFrameShiftsIgnored | Total number of frame shifts detected, but ignored due to ignore list | non-negative integer | 1 | +| qc.frameShifts.score | Score for "Frame shifts" QC rule | float | 0.5 | +| qc.frameShifts.status | Status for "Frame shifts" QC rule | string: `good\|mediocre\|bad` | bad | +| qc.stopCodons.stopCodons | List of detected stop codons in "Stop codons" QC rule | comma separated list of strings | ORF1a:4715,ORF1a:4716 | +| qc.stopCodons.totalStopCodons | Total number of detected stop codons in "Stop codons" QC rule | non-negative integer | 2 | +| qc.stopCodons.score | Score for "Stop codons" QC rule | float | 0.5 | +| qc.stopCodons.status | Status for "Stop codons" QC rule | string: `good\|mediocre\|bad` | bad | +| isReverseComplement | Whether query sequences were transformed using reverse complement operation before alignment | boolean | false | +| errors | List of errors during processing | comma separated list of strings | | +| warnings | List of warnings during processing | comma separated list of strings | | +| failedCdses | List of CDS that failed translation | comma separated list of strings | | > ⚠️ Note that sequence names (`seqName` column) are not guaranteed to be unique (and in practice are not unique very often). So indices is the only way to reliably link together inputs and outputs.