Skip to content

Commit

Permalink
Add sections on sampling methods
Browse files Browse the repository at this point in the history
Reword the subsampling introduction with *what* it is, followed by
examples on *why* paired with *how*.

This also allows future sampling methods such as weighted sampling to be
added by simply including a new section.
  • Loading branch information
victorlin committed Aug 19, 2024
1 parent 98134e1 commit 65f8fe4
Showing 1 changed file with 61 additions and 5 deletions.
66 changes: 61 additions & 5 deletions src/guides/bioinformatics/filtering-and-subsampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,11 +114,56 @@ Subsampling within ``augur filter``
Subsampling is applied after all standard filter options and before
force-inclusive filter options.

Another common filtering operation is subsetting of data to achieve a more
even spatio-temporal distribution or to cut-down data set size to more
manageable numbers. The filter command allows you to partition the data into
groups based on column values and sample uniformly. For example, target one
sequence per month from each country:
Another common filtering operation is **subsampling**: selection of data using
rules based on output size rather than individual sequence attributes. These are
the sampling methods supported by ``augur filter`` and a final section for caveats:

.. contents::
:local:

Random sampling
---------------

The simplest scenario is a reduction of dataset size to more manageable numbers.
For example, limit the output to 100 sequences:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Random sampling is easy to define but can expose sampling bias in some datasets.
Consider uniform sampling to reduce sampling bias.

Uniform sampling
----------------

``--group-by`` allows you to partition the data into groups based on column
values and sample uniformly. For example, sample evenly across countries over
time:

.. code-block:: bash
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude exclude.txt \
--group-by country year month \
--subsample-max-sequences 100 \
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
An alternative to ``--subsample-max-sequences`` is ``--sequences-per-group``.
This is useful if you care less about total sample size and more about having
a fixed number of sequences from each group. For example, target one sequence
per month from each country:

.. code-block:: bash
Expand All @@ -132,6 +177,17 @@ sequence per month from each country:
--output-sequences subsampled_sequences.fasta \
--output-metadata subsampled_metadata.tsv
Caveats
-------

For these sampling methods, the number of targeted sequences per group does not
take into account the actual number of sequences available in the input data.
For example, consider a dataset with 200 sequences available from 2023 and 100
sequences available from 2024. ``--group-by year --subsample-max-sequences 300``
is equivalent to ``--group-by year --sequences-per-group 150``. This will take
150 sequences from 2023 and all 100 sequences from 2024 for a total of 250
sequences, which is less than the target of 300.

Subsampling using multiple ``augur filter`` commands
====================================================

Expand Down

0 comments on commit 65f8fe4

Please sign in to comment.