layout | title | date_published | date_modified | author | maintainer | logo | logo_alt |
---|---|---|---|---|---|---|---|
docpost |
CLIMB-COVID Data Changelog |
2021-04-22 14:00:00 +0000 |
2021-05-14 17:45:00 +0000 |
samstudio8 |
samstudio8 |
assets/dipi-patch.png |
CLIMB-COVID DIPI Mission Patch |
All notable changes to CLIMB-COVID APIs, data or interchange formats that have impact to users or other pipelines should be documented in this file. Changes described here may only be a subset of all changes to a project as this log concerns itself only with changes that impact how data is provided or consumed by users or other pipelines. The following DIPI projects are routinely using this CHANGELOG.
Majora API
-- metadata APIsOcarina
-- Majora command line client (Full changelog)Elan
-- inbound data pipelineTael
-- MQTT messaging toolsAsklepian
-- Outbound PHE pipelineCLIMB-COVID
-- metaprojects (eg. status page, data page)Foel
-- second generation CLIMB-COVID ingest system
The format is based on Keep a Changelog.
- Elan v2 supersedes Elan v1; all data processed today and onward will pass through Elan v2
- Effective immediately, "Test Majora", lovingly known as "Majora Magenta" has moved URL to
https://majora-test.covid19.climb.ac.uk/
.- The test database has not been migrated and users will need to register new accounts and new OAuth applications.
- An additional column
ambiguities
providing a pipe separated list of ambiguous regions has been added to metadata outputs including mutations.
- The location of individual BAM and BAI files are being migrated over the weekend. While all care is being taken to minimise impact to users, attempts to access individual BAM and BAI files may fail over this weekend period.
- Foel will now reject BAM submissions if the BAM file is larger than 2 GB (defined as 2e9 bytes)
- The "individual FASTA" directory at
/bham/artifacts/published/fasta
has been removed - The "individual BAM" directory at
/bham/artifacts/published/alignment
has been removed
- New FASTA files are no longer linked in
/cephfs/covid/bham/artifacts/published/fasta
- Users should extract their sequences directly from the daily consensus FASTA (
elan.consensus.fasta
), leveraging its index. A sequence extraction utility (seq_extract
) is available via our utilities repository.
- Users should extract their sequences directly from the daily consensus FASTA (
- New BAM files are no longer linked in
/cephfs/covid/bham/artifacts/published/alignment
- Users must now resolve the location of BAM files using
/cephfs/covid/artifacts/elan/latest/majora.pag_lookup.tsv
- Users must now resolve the location of BAM files using
- As a side effect of an update to the publishing pipeline, the
/cephfs/covid/bham/artifacts/published/20220128
directory was accidentally removed. As we are working towards removing these directories anyway, the change will remain and these "dated artifact" directories will no longer be published.- Users can continue to use
/cephfs/covid/bham/artifacts/published/latest
which will contain the latest artifacts to maintain compatibility
- Users can continue to use
- The files detailing which samples were missing and removed by Elan have moved from
/cephfs/covid/bham/artifacts/published/latest/summary
to/cephfs/covid/artifacts/elan/latest/
:- Use
elan.missing.ls
for determining why samples were not ingested by Elan (missing metadata or missing files) - Use
elan.quickcheck.ls
for samples rejected by Elan screening (invalid FASTA or BAM)
- Use
- Created a separate nextflow for geography cleaning steps, moving them earlier in the pipeline. Results should be unaffected.
/cephfs/covid/artifacts
is the new top-level home for artifacts- Moving the location of the artifacts is necessary for longer term maintenance of the CLIMB-COVID project and additionally allows us to work toward solving the infamous "big dir" problem.
- The FASTA header for the daily
elan.consensus.fasta
will no longer contain the pipe delimited "row number" and will now only contain the Published Artifact Group (PAG) name. The FAI index will therefore only contain the PAG names, making it easier to maintain random access to sequences. Pipe delimited metadata may be added again in future as a sequence comment, rather than as part of the sequence header.- Python code using
seq_name.split('|')[0]
to parse this header will be unaffected, as the split will succeed, but developers can now just useseq_name
.
- Python code using
/cephfs/covid/artifacts/elan/latest/majora.pag_lookup.tsv
now allows users to write scripts to look up (central_sample_id
ANDrun_name
) tuples, ORpag_name
to resolve the locations of published BAMs
- The individual FASTA and BAM directories at
/bham/artifacts/published/fasta
and/bham/artifacts/published/alignment
will be removed without exception on 2022-01-31. These directories contain hundreds of thousands of symlinks and need to be removed as part of our solution to the "big dir" problem.- FASTA users: As per previous guidance, users should extract their sequences directly from the daily consensus FASTA (
elan.consensus.fasta
), leveraging its index. For users unsure how to do this effectively, we have now made a sequence extraction utility (seq_extract
) available via our utilities repository. - BAM users: Users must now resolve the location of BAM files using the new
majora.pag_lookup.tsv
, first published today
- FASTA users: As per previous guidance, users should extract their sequences directly from the daily consensus FASTA (
- The minimum Ocarina client version for Majora to accept requests has been bumped to v
0.44.0
. Requests from clients below this version number will be rejected immediately.
- After consultation, the "individual QC" outputs are now considered deprecated and will be removed without warning in the near future. A new API service will allow users to fetch QC information as an alternative.
- The "individual FASTA" outputs are now considered deprecated and will be removed without warning in the immediate future. Users should follow our recommendation and ensure they are using the daily consensus FASTA and the corresponding FAI index to query for sequences as an alternative.
- Elan will now add files, metrics and QC reports to Majora as
service-elan
, notnicholsz
- Foel will now add empty biosamples, libraries and sequencing runs to Majora as
service-foel
, notnicholsz
- Elan will refuse to process FASTA files containing one or more non-IUPAC characters or
-
api.artifact.biosample.addempty
supports using an optionalmetadata
parameter to add key value metadata to empty biosamples
sample_route
is now a required field in the metadata CSV
- The minimal
csv_template_version
has been increased to2
to coincide with the newsample_route
field. Submissions usingcsv_template_version
1 will be rejected.
- Automated GISAID submissions are now made every day, with a seven day lag
INVITROGEN
added totest_kit
ct validator
- Substitute 'N' for '?' in input FASTA. (continue otherwise to reject all FASTA input sequences with non-IUPAC characters)
- Add flag
--score-N=0
to minimap2 command as part of MSA building step - Reject all FASTA input sequences with non-IUPAC characters
- Temporary fix of replacing ? characters in incoming FASTA with Ns
- Add further parallelization and line count checks to handle file corruptions
- Attempted switch back to big tree - one off tree published, minus the microreact outputs
- Ran with refreshed base tree (~475k tips down from 1.4m tips)
- Majora will now correctly reject biosamples with a
collection_date
after thereceived_date
with an error message: "Sample cannot be collected after it was received. Perhaps they have been swapped?"
- Added gzipped output files to s3 bucket
- Updated memory requirements for publish steps
- Parallelized minimap2 and alignment steps
- The head node has moved to different hardware. SSH users and sequence uploaders should be aware that the ESDCA key has been changed.
- Users should not be alarmed by the rather alarming message (below) that will be issued on the command line the first time they attempt to log in after this change:
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
- Follow the instructions in the error message to remove the offending keys from your
known_hosts
file. The line numbers to remove will be listed in the message, for example this message would indicate you must remove line 8 from yourknown_hosts
file:
Offending key for IP in /home/user/.ssh/known_hosts:8 <= number after the colon is the line number, this is just an example
- Ocarina 0.43.0 now allows OAuth authentication for
pag suppress
, users will need to update to 0.43.0 for the new configuration to work. - Ocarina 0.42.1 uses exit codes as suggested by BSD
sysexits
. This may affect users catching particular non-zero exit codes from the Ocarina application.
- To improve performance, the MSA generation algorithm will be swapped from
datafunk
togofasta
. - Users should be aware (but not concerned) of a small impact to the integrity of 14 sequences in the MSA (and downstream Asklepian tables).
- See the accompanying change advisory notice for further information.
- Majora will no longer accept a
collection_date
orreceived_date
for a biosample where the year is not 2020 or higher, regardless of whether the biosample is being added or updated
mqtt-message
automatically addsts
key to payloads, containing the UNIX epoch time- Recommended that users no longer use the
date
andtime
fields for anything other than human readability
- Recommended that users no longer use the
- Now runs
gofasta updown list
and creates output files for CIVET3
- Adds a metadata output to
cog/UTLA_genome_counts_<date>.csv
- Excludes samples labelled to omit when the
why_excluded
column is not in published metadata output to avoid mysterious duplicate rows
- Files and directories in user upload dirs (
climb-covid19-user/upload
) will be periodically scanned and deleted if they are more than two weeks old.- This first run on June 21st will remove data older than June 7th
- Users with the
force_add_biosampleartifact
scope can now addsender_sample_id
to blank biosamples created through thebiosample.addempty
endpoint (Majora biosample.addempty docs), regardless of whether the sample was previously added byaddempty
before- This does not change existing behaviour that prevents samples that have been "filled in" with full metadata via
biosample.add
from being modified bybiosample.addempty
- This does not change existing behaviour that prevents samples that have been "filled in" with full metadata via
- Started performing geography cleaning of all adm1s (largely global cleaning)
- Users with the
force_add_biosampleartifact
scope can now addsender_sample_id
to blank biosamples created through thebiosample.addempty
endpoint (Majora biosample.addempty docs) - Existing biosample artifacts that were collected and/or received more than a year ago can now be updated through the metadata uploader, or Ocarina (without
--partial
) as the past-date checks are now skipped for existing data- This does not change the behaviour of
collection_date
orreceived_date
from over 365 days ago being rejected for new samples collection_date
andreceived_date
set to a future date are still rejected regardless of whether the sample exists or not
- This does not change the behaviour of
- Ocarina 0.40.2
ocarina empty biosample
now takes an additional--sender-sample-id
option (Ocarina Changelog)
- Makes civet output directories identical to previous except for explicitly containing "public" or "private" in file names
- Adds additional masking before usher updates at sites which are known to cause usher problems placing on a tree
- Fixed phylotyping bug (change to
clusterfunk
) - Will be published to
latest
and today's phylopipe1 output when it completes in 2 days time will be published toold
- New columns added to metadata outputs in line with pangolin changes are now all populated
- The pillar_2 column has been renamed is_pillar_2 and contains Y/N for consistency with incoming pipeline
- Bug fixes adding cleanded geography to metadata
- Additional requested columns/metadata outputs published
- Major release of pangolin v3.0 with downstream changes in datapipe, phylopipe, pathogenwatch expected in coming days
- Pangolin output now includes a new
version
column containing information about inference engine (pangoLEARN, usher or designation hash) and data release on which assignments were based, to be used instead of pangoLEARN_version column - Pangolin outputs constellation calls made by scorpio for a number of VOCs/VUIs
- Patched a regression in
mqtt-client.py
that caused clients not requiring any environment variables (--envreq
) to silently fail to start their specified command- Clients not specifying
--envreq
should be restarted as soon as possible
- Clients not specifying
- Beta release of phylopipe2.0 published daily
mqtt-client.py
automatically subscribes clients to a "control topic" namedCOGUK/infrastructure/pipelines/<who>/control
, wherewho
is the name of the pipeline provided to--who
- Pipelines can now be manually raised by sending messages to the client control topic, specfying an
action
key with the value ofraise
- The
started
message emitted bymqtt-client.py
now includes areason
key, explaining why the pipeline has started
- Updated minimap2 sequence divergence threshold preset from 5% to 20% divergence as increasing numbers of (primarily GISAID) sequences were being lost from the MSA
- Passes
uk_lineage
metadata column through from previous datapipe run for use by phylopipe
- New fields added to Genome table
- Field
Adm1
added to Genome table - Field
Pillar
added to Genome table - Field
Published_date
added to Genome table
- Field
- CLIMB-COVID Status Page is now automatically updated every day
- Payload keys targeted by
--envreq
are copied through to the output payload if--envprefix
has not been provided.--envprefix
still copies all payload keys to the output with the specified prefix.
- Payload keys that will be passed through to the
finished
output payload will also be emitted in thestart
ouput payload
--payload-passthrough
parameter allows keys from the input payload to be automatically copied to thefinished
output payload
- Elan will no longer sort uploaded BAMs. User guidance has always been that BAMs should be sorted before uploaded so this step is a waste of compute resources and time. Unsorted BAMs will now be considered invalid and rejected by Elan as part of the quickcheck report.
- Introduced this CHANGELOG to track notable changes to data and pipeline interchange formats. Significant changes before this date are out of the scope of this CHANGELOG.