-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #261 from invitae/main
Merge Invitae local changes used to build recent UTA
- Loading branch information
Showing
87 changed files
with
7,416 additions
and
424 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
name: Continuous Integration | ||
|
||
on: | ||
push: | ||
branches: | ||
- main | ||
pull_request: | ||
branches: | ||
- main | ||
merge_group: | ||
types: | ||
- checks_requested | ||
|
||
jobs: | ||
test: | ||
name: Run tests | ||
runs-on: ubuntu-latest | ||
timeout-minutes: 10 | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v3 | ||
- name: Build image | ||
run: docker build --target uta-test -t uta-test . | ||
- name: Run tests | ||
run: docker run --rm uta-test python -m unittest |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
FROM ubuntu:22.04 as uta | ||
|
||
# set python version and define arguments | ||
ARG python_version="3.10" | ||
|
||
# list and install dependencies | ||
ARG dependencies="python${python_version} python3-dev python3-pip rsync git postgresql-client-14 tabix" | ||
|
||
RUN apt-get update && apt-get install -y $dependencies && apt-get clean | ||
|
||
# install pysam, copy code, and run pip install | ||
RUN ln -s /usr/bin/python3 /usr/bin/python | ||
RUN python -m pip install --upgrade pip | ||
RUN pip install --upgrade setuptools | ||
RUN pip install pysam | ||
|
||
WORKDIR /opt/repos/uta/ | ||
COPY pyproject.toml ./ | ||
COPY etc ./etc | ||
COPY misc ./misc | ||
COPY sbin ./sbin | ||
COPY src ./src | ||
RUN pip install -e .[dev] | ||
|
||
|
||
# UTA test image | ||
FROM uta as uta-test | ||
RUN DEBIAN_FRONTEND=noninteractive apt-get -yq install postgresql | ||
COPY tests ./tests | ||
RUN pip install -e .[test] | ||
RUN useradd uta-tester | ||
RUN chown -R uta-tester . | ||
USER uta-tester |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -203,8 +203,8 @@ you will not need to install PostgreSQL or any of its dependencies. | |
(code) version used to build the instance. | ||
|
||
$ psql -h localhost -U anonymous -d uta -c "select * from $uta_v.meta" | ||
key | value | ||
|
||
key | value | ||
----------------+-------------------------------------------------------------------- | ||
schema_version | 1.1 | ||
created on | 2015-08-21T10:53:50.666152 | ||
|
@@ -213,7 +213,7 @@ you will not need to install PostgreSQL or any of its dependencies. | |
(4 rows) | ||
|
||
6. (Optional) To configure [hgvs](https://github.com/biocommons/hgvs) | ||
to use this local installation, consult the | ||
to use this local installation, consult the | ||
[hgvs documentation](https://hgvs.readthedocs.io/en/latest/installation.html#local-installation-of-uta-optional) | ||
|
||
### Installing from database dumps | ||
|
@@ -253,6 +253,7 @@ the installation environment.* | |
|
||
## Developer Setup | ||
|
||
### Virtual Environment | ||
To develop UTA, follow these steps. | ||
|
||
1. Set up a virtual environment using your preferred method. | ||
|
@@ -272,3 +273,110 @@ To develop UTA, follow these steps. | |
4. To run the tests: | ||
|
||
$ python3 -m unittest | ||
|
||
### Docker | ||
|
||
1. Clone UTA and build docker image: | ||
|
||
$ git clone [email protected]:biocommons/uta.git | ||
$ cd uta | ||
$ docker build -t uta . | ||
|
||
2. Restore a database or load a new one using the instructions [above](#installing-from-database-dumps). | ||
|
||
3. Run container and tests | ||
|
||
$ docker run -it --rm uta bash | ||
|
||
4. Testing | ||
|
||
$ docker build --target uta-test -t uta-test . | ||
$ docker run --rm uta-test python -m unittest | ||
|
||
## UTA update procedure | ||
|
||
Requires docker. | ||
|
||
### 0. Setup | ||
|
||
Make directories: | ||
``` | ||
mkdir -p $(pwd)/ncbi-data | ||
mkdir -p $(pwd)/output/artifacts | ||
mkdir -p $(pwd)/output/logs | ||
``` | ||
|
||
Set variables: | ||
``` | ||
export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b | ||
export UTA_ETL_OLD_UTA_VERSION=UTA_ETL_OLD_UTA_IMAGE_TAG | ||
export UTA_ETL_NEW_UTA_VERSION=uta_20240512 | ||
export UTA_ETL_NCBI_DIR=./ncbi-data | ||
export UTA_ETL_WORK_DIR=./output/artifacts | ||
export UTA_ETL_LOG_DIR=./output/logs | ||
``` | ||
|
||
Build the UTA image: | ||
``` | ||
docker build --target uta -t uta-update . | ||
``` | ||
|
||
### 1. Download SeqRepo data | ||
``` | ||
docker compose run seqrepo-pull | ||
``` | ||
|
||
Note: pulling data takes ~30 minutes and requires ~13 GB. | ||
Note: a container called seqrepo will be left behind. | ||
|
||
### 2. Extract and transform data from NCBI | ||
|
||
Download files from NCBI, extract into intermediate files, and load into UTA and SeqRepo. | ||
|
||
See 2A for nuclear transcripts and 2B for mitochondrial transcripts. | ||
|
||
#### 2A. Nuclear transcripts | ||
``` | ||
docker compose run ncbi-download | ||
docker compose run uta-extract | ||
docker compose run seqrepo-load | ||
docker compose run uta-load | ||
``` | ||
|
||
#### 2B. Mitochondrial transcripts | ||
``` | ||
docker compose -f docker-compose.yml -f misc/mito-transcripts/docker-compose-mito-extract.yml run mito-extract | ||
docker compose run seqrepo-load | ||
docker compose run uta-load | ||
``` | ||
|
||
#### 2C. Manual splign transcripts | ||
To load splign-manual transcripts, the workflow expects an input txdata.yaml file and splign alignments. Define this path | ||
using the environment variable $UTA_SPLIGN_MANUAL_DIR. These file paths should exist: | ||
- `$UTA_SPLIGN_MANUAL_DIR/splign-manual/txdata.yaml` | ||
- `$UTA_SPLIGN_MANUAL_DIR/splign-manual/alignments/*.splign` | ||
|
||
[txdata.yaml](loading/data/splign-manual/txdata.yaml) defines the transcripts and their metadata. The [alignments dir](loading/data/splign-manual/alignments) contains the splign alignments. | ||
To run the workflow: | ||
``` | ||
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/ | ||
docker compose run splign-manual | ||
``` | ||
|
||
UTA has updated and the database has been dumped into a pgd file in `UTA_ETL_WORK_DIR`. SeqRepo has been updated in place. | ||
|
||
|
||
## Migrations | ||
UTA uses alembic to manage database migrations. To auto-generate a migration: | ||
``` | ||
alembic -c etc/alembic.ini revision --autogenerate -m "description of the migration" | ||
``` | ||
This will create a migration script in the alembic/versions directory. | ||
Adjust the upgrade and downgrade function definitions. To apply the migration: | ||
``` | ||
alembic -c etc/alembic.ini upgrade head | ||
``` | ||
To reverse a migration, use `downgrade` with the number of steps to reverse. For example, to reverse the last: | ||
``` | ||
alembic -c etc/alembic.ini downgrade -1 | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# docker compose file for the UTA update procedure | ||
|
||
version: '3' | ||
|
||
services: | ||
seqrepo-pull: | ||
user: root | ||
image: uta-update | ||
command: sbin/seqrepo-pull | ||
volumes: | ||
- seqrepo-volume:/biocommons/dl.biocommons.org/seqrepo | ||
network_mode: host | ||
ncbi-download: | ||
image: uta-update | ||
command: sbin/ncbi-download etc/ncbi-files.txt /ncbi-dir | ||
volumes: | ||
- .:/opt/repos/uta | ||
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir | ||
working_dir: /opt/repos/uta | ||
network_mode: host | ||
uta-extract: | ||
image: uta-update | ||
command: sbin/uta-extract /ncbi-dir /uta-extract/work /uta-extract/logs | ||
volumes: | ||
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir | ||
- ${UTA_ETL_WORK_DIR}:/uta-extract/work | ||
- ${UTA_ETL_LOG_DIR}:/uta-extract/logs | ||
working_dir: /opt/repos/uta | ||
network_mode: host | ||
seqrepo-load: | ||
image: uta-update | ||
command: sbin/seqrepo-load /seqrepo-load/work /seqrepo-load/logs | ||
volumes: | ||
- seqrepo-volume:/biocommons/dl.biocommons.org/seqrepo | ||
- ${UTA_ETL_WORK_DIR}:/seqrepo-load/work | ||
- ${UTA_ETL_LOG_DIR}:/seqrepo-load/logs | ||
working_dir: /opt/repos/uta | ||
network_mode: host | ||
uta: | ||
container_name: uta | ||
image: biocommons/uta:${UTA_ETL_OLD_UTA_IMAGE_TAG} | ||
environment: | ||
- POSTGRES_HOST_AUTH_METHOD=trust | ||
healthcheck: | ||
test: psql -h localhost -U anonymous -d uta -c "select * from ${UTA_ETL_OLD_UTA_IMAGE_TAG}.meta" | ||
interval: 10s | ||
retries: 80 | ||
network_mode: host | ||
uta-load: | ||
image: uta-update | ||
command: sbin/uta-load ${UTA_ETL_OLD_UTA_VERSION} ${UTA_ETL_NEW_UTA_VERSION} /ncbi-dir /uta-load/work /uta-load/logs | ||
depends_on: | ||
uta: | ||
condition: service_healthy | ||
volumes: | ||
- seqrepo-volume:/biocommons/dl.biocommons.org/seqrepo | ||
- ${UTA_ETL_WORK_DIR}:/uta-load/work | ||
- ${UTA_ETL_LOG_DIR}:/uta-load/logs | ||
network_mode: host | ||
|
||
volumes: | ||
seqrepo-volume: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
# A generic, single database configuration. | ||
|
||
[alembic] | ||
# path to migration scripts | ||
script_location = src/alembic | ||
|
||
# template used to generate migration file names; The default value is %%(rev)s_%%(slug)s | ||
# Uncomment the line below if you want the files to be prepended with date and time | ||
# see https://alembic.sqlalchemy.org/en/latest/tutorial.html#editing-the-ini-file | ||
# for all available tokens | ||
# file_template = %%(year)d_%%(month).2d_%%(day).2d_%%(hour).2d%%(minute).2d-%%(rev)s_%%(slug)s | ||
|
||
# sys.path path, will be prepended to sys.path if present. | ||
# defaults to the current working directory. | ||
prepend_sys_path = . | ||
|
||
# timezone to use when rendering the date within the migration file | ||
# as well as the filename. | ||
# If specified, requires the python>=3.9 or backports.zoneinfo library. | ||
# Any required deps can installed by adding `alembic[tz]` to the pip requirements | ||
# string value is passed to ZoneInfo() | ||
# leave blank for localtime | ||
# timezone = | ||
|
||
# max length of characters to apply to the | ||
# "slug" field | ||
# truncate_slug_length = 40 | ||
|
||
# set to 'true' to run the environment during | ||
# the 'revision' command, regardless of autogenerate | ||
# revision_environment = false | ||
|
||
# set to 'true' to allow .pyc and .pyo files without | ||
# a source .py file to be detected as revisions in the | ||
# versions/ directory | ||
# sourceless = false | ||
|
||
# version location specification; This defaults | ||
# to alembic/versions. When using multiple version | ||
# directories, initial revisions must be specified with --version-path. | ||
# The path separator used here should be the separator specified by "version_path_separator" below. | ||
# version_locations = %(here)s/bar:%(here)s/bat:alembic/versions | ||
|
||
# version path separator; As mentioned above, this is the character used to split | ||
# version_locations. The default within new alembic.ini files is "os", which uses os.pathsep. | ||
# If this key is omitted entirely, it falls back to the legacy behavior of splitting on spaces and/or commas. | ||
# Valid values for version_path_separator are: | ||
# | ||
# version_path_separator = : | ||
# version_path_separator = ; | ||
# version_path_separator = space | ||
version_path_separator = os # Use os.pathsep. Default configuration used for new projects. | ||
|
||
# set to 'true' to search source files recursively | ||
# in each "version_locations" directory | ||
# new in Alembic version 1.10 | ||
# recursive_version_locations = false | ||
|
||
# the output encoding used when revision files | ||
# are written from script.py.mako | ||
# output_encoding = utf-8 | ||
|
||
sqlalchemy.url = postgresql://uta_admin:@localhost/uta | ||
|
||
|
||
[post_write_hooks] | ||
# post_write_hooks defines scripts or Python functions that are run | ||
# on newly generated revision scripts. See the documentation for further | ||
# detail and examples | ||
|
||
# format using "black" - use the console_scripts runner, against the "black" entrypoint | ||
# hooks = black | ||
# black.type = console_scripts | ||
# black.entrypoint = black | ||
# black.options = -l 79 REVISION_SCRIPT_FILENAME | ||
|
||
# lint with attempts to fix using "ruff" - use the exec runner, execute a binary | ||
# hooks = ruff | ||
# ruff.type = exec | ||
# ruff.executable = %(here)s/.venv/bin/ruff | ||
# ruff.options = --fix REVISION_SCRIPT_FILENAME | ||
|
||
# Logging configuration | ||
[loggers] | ||
keys = root,sqlalchemy,alembic | ||
|
||
[handlers] | ||
keys = console | ||
|
||
[formatters] | ||
keys = generic | ||
|
||
[logger_root] | ||
level = WARN | ||
handlers = console | ||
qualname = | ||
|
||
[logger_sqlalchemy] | ||
level = WARN | ||
handlers = | ||
qualname = sqlalchemy.engine | ||
|
||
[logger_alembic] | ||
level = INFO | ||
handlers = | ||
qualname = alembic | ||
|
||
[handler_console] | ||
class = StreamHandler | ||
args = (sys.stderr,) | ||
level = NOTSET | ||
formatter = generic | ||
|
||
[formatter_generic] | ||
format = %(levelname)-5.5s [%(name)s] %(message)s | ||
datefmt = %H:%M:%S |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.