Skip to content

Commit

Permalink
Merge pull request #261 from invitae/main
Browse files Browse the repository at this point in the history
Merge Invitae local changes used to build recent UTA
  • Loading branch information
reece authored Nov 18, 2024
2 parents 53243ea + 684c804 commit 01c13ca
Show file tree
Hide file tree
Showing 87 changed files with 7,416 additions and 424 deletions.
25 changes: 25 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: Continuous Integration

on:
push:
branches:
- main
pull_request:
branches:
- main
merge_group:
types:
- checks_requested

jobs:
test:
name: Run tests
runs-on: ubuntu-latest
timeout-minutes: 10
steps:
- name: Checkout
uses: actions/checkout@v3
- name: Build image
run: docker build --target uta-test -t uta-test .
- name: Run tests
run: docker run --rm uta-test python -m unittest
33 changes: 33 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
FROM ubuntu:22.04 as uta

# set python version and define arguments
ARG python_version="3.10"

# list and install dependencies
ARG dependencies="python${python_version} python3-dev python3-pip rsync git postgresql-client-14 tabix"

RUN apt-get update && apt-get install -y $dependencies && apt-get clean

# install pysam, copy code, and run pip install
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN python -m pip install --upgrade pip
RUN pip install --upgrade setuptools
RUN pip install pysam

WORKDIR /opt/repos/uta/
COPY pyproject.toml ./
COPY etc ./etc
COPY misc ./misc
COPY sbin ./sbin
COPY src ./src
RUN pip install -e .[dev]


# UTA test image
FROM uta as uta-test
RUN DEBIAN_FRONTEND=noninteractive apt-get -yq install postgresql
COPY tests ./tests
RUN pip install -e .[test]
RUN useradd uta-tester
RUN chown -R uta-tester .
USER uta-tester
114 changes: 111 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,8 +203,8 @@ you will not need to install PostgreSQL or any of its dependencies.
(code) version used to build the instance.

$ psql -h localhost -U anonymous -d uta -c "select * from $uta_v.meta"
key | value

key | value
----------------+--------------------------------------------------------------------
schema_version | 1.1
created on | 2015-08-21T10:53:50.666152
Expand All @@ -213,7 +213,7 @@ you will not need to install PostgreSQL or any of its dependencies.
(4 rows)

6. (Optional) To configure [hgvs](https://github.com/biocommons/hgvs)
to use this local installation, consult the
to use this local installation, consult the
[hgvs documentation](https://hgvs.readthedocs.io/en/latest/installation.html#local-installation-of-uta-optional)

### Installing from database dumps
Expand Down Expand Up @@ -253,6 +253,7 @@ the installation environment.*

## Developer Setup

### Virtual Environment
To develop UTA, follow these steps.

1. Set up a virtual environment using your preferred method.
Expand All @@ -272,3 +273,110 @@ To develop UTA, follow these steps.
4. To run the tests:

$ python3 -m unittest

### Docker

1. Clone UTA and build docker image:

$ git clone [email protected]:biocommons/uta.git
$ cd uta
$ docker build -t uta .

2. Restore a database or load a new one using the instructions [above](#installing-from-database-dumps).

3. Run container and tests

$ docker run -it --rm uta bash

4. Testing

$ docker build --target uta-test -t uta-test .
$ docker run --rm uta-test python -m unittest

## UTA update procedure

Requires docker.

### 0. Setup

Make directories:
```
mkdir -p $(pwd)/ncbi-data
mkdir -p $(pwd)/output/artifacts
mkdir -p $(pwd)/output/logs
```

Set variables:
```
export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_OLD_UTA_VERSION=UTA_ETL_OLD_UTA_IMAGE_TAG
export UTA_ETL_NEW_UTA_VERSION=uta_20240512
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
```

Build the UTA image:
```
docker build --target uta -t uta-update .
```

### 1. Download SeqRepo data
```
docker compose run seqrepo-pull
```

Note: pulling data takes ~30 minutes and requires ~13 GB.
Note: a container called seqrepo will be left behind.

### 2. Extract and transform data from NCBI

Download files from NCBI, extract into intermediate files, and load into UTA and SeqRepo.

See 2A for nuclear transcripts and 2B for mitochondrial transcripts.

#### 2A. Nuclear transcripts
```
docker compose run ncbi-download
docker compose run uta-extract
docker compose run seqrepo-load
docker compose run uta-load
```

#### 2B. Mitochondrial transcripts
```
docker compose -f docker-compose.yml -f misc/mito-transcripts/docker-compose-mito-extract.yml run mito-extract
docker compose run seqrepo-load
docker compose run uta-load
```

#### 2C. Manual splign transcripts
To load splign-manual transcripts, the workflow expects an input txdata.yaml file and splign alignments. Define this path
using the environment variable $UTA_SPLIGN_MANUAL_DIR. These file paths should exist:
- `$UTA_SPLIGN_MANUAL_DIR/splign-manual/txdata.yaml`
- `$UTA_SPLIGN_MANUAL_DIR/splign-manual/alignments/*.splign`

[txdata.yaml](loading/data/splign-manual/txdata.yaml) defines the transcripts and their metadata. The [alignments dir](loading/data/splign-manual/alignments) contains the splign alignments.
To run the workflow:
```
export UTA_SPLIGN_MANUAL_DIR=$(pwd)/loading/data/splign-manual/
docker compose run splign-manual
```

UTA has updated and the database has been dumped into a pgd file in `UTA_ETL_WORK_DIR`. SeqRepo has been updated in place.


## Migrations
UTA uses alembic to manage database migrations. To auto-generate a migration:
```
alembic -c etc/alembic.ini revision --autogenerate -m "description of the migration"
```
This will create a migration script in the alembic/versions directory.
Adjust the upgrade and downgrade function definitions. To apply the migration:
```
alembic -c etc/alembic.ini upgrade head
```
To reverse a migration, use `downgrade` with the number of steps to reverse. For example, to reverse the last:
```
alembic -c etc/alembic.ini downgrade -1
```
62 changes: 62 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# docker compose file for the UTA update procedure

version: '3'

services:
seqrepo-pull:
user: root
image: uta-update
command: sbin/seqrepo-pull
volumes:
- seqrepo-volume:/biocommons/dl.biocommons.org/seqrepo
network_mode: host
ncbi-download:
image: uta-update
command: sbin/ncbi-download etc/ncbi-files.txt /ncbi-dir
volumes:
- .:/opt/repos/uta
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
working_dir: /opt/repos/uta
network_mode: host
uta-extract:
image: uta-update
command: sbin/uta-extract /ncbi-dir /uta-extract/work /uta-extract/logs
volumes:
- ${UTA_ETL_NCBI_DIR}:/ncbi-dir
- ${UTA_ETL_WORK_DIR}:/uta-extract/work
- ${UTA_ETL_LOG_DIR}:/uta-extract/logs
working_dir: /opt/repos/uta
network_mode: host
seqrepo-load:
image: uta-update
command: sbin/seqrepo-load /seqrepo-load/work /seqrepo-load/logs
volumes:
- seqrepo-volume:/biocommons/dl.biocommons.org/seqrepo
- ${UTA_ETL_WORK_DIR}:/seqrepo-load/work
- ${UTA_ETL_LOG_DIR}:/seqrepo-load/logs
working_dir: /opt/repos/uta
network_mode: host
uta:
container_name: uta
image: biocommons/uta:${UTA_ETL_OLD_UTA_IMAGE_TAG}
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
healthcheck:
test: psql -h localhost -U anonymous -d uta -c "select * from ${UTA_ETL_OLD_UTA_IMAGE_TAG}.meta"
interval: 10s
retries: 80
network_mode: host
uta-load:
image: uta-update
command: sbin/uta-load ${UTA_ETL_OLD_UTA_VERSION} ${UTA_ETL_NEW_UTA_VERSION} /ncbi-dir /uta-load/work /uta-load/logs
depends_on:
uta:
condition: service_healthy
volumes:
- seqrepo-volume:/biocommons/dl.biocommons.org/seqrepo
- ${UTA_ETL_WORK_DIR}:/uta-load/work
- ${UTA_ETL_LOG_DIR}:/uta-load/logs
network_mode: host

volumes:
seqrepo-volume:
116 changes: 116 additions & 0 deletions etc/alembic.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# A generic, single database configuration.

[alembic]
# path to migration scripts
script_location = src/alembic

# template used to generate migration file names; The default value is %%(rev)s_%%(slug)s
# Uncomment the line below if you want the files to be prepended with date and time
# see https://alembic.sqlalchemy.org/en/latest/tutorial.html#editing-the-ini-file
# for all available tokens
# file_template = %%(year)d_%%(month).2d_%%(day).2d_%%(hour).2d%%(minute).2d-%%(rev)s_%%(slug)s

# sys.path path, will be prepended to sys.path if present.
# defaults to the current working directory.
prepend_sys_path = .

# timezone to use when rendering the date within the migration file
# as well as the filename.
# If specified, requires the python>=3.9 or backports.zoneinfo library.
# Any required deps can installed by adding `alembic[tz]` to the pip requirements
# string value is passed to ZoneInfo()
# leave blank for localtime
# timezone =

# max length of characters to apply to the
# "slug" field
# truncate_slug_length = 40

# set to 'true' to run the environment during
# the 'revision' command, regardless of autogenerate
# revision_environment = false

# set to 'true' to allow .pyc and .pyo files without
# a source .py file to be detected as revisions in the
# versions/ directory
# sourceless = false

# version location specification; This defaults
# to alembic/versions. When using multiple version
# directories, initial revisions must be specified with --version-path.
# The path separator used here should be the separator specified by "version_path_separator" below.
# version_locations = %(here)s/bar:%(here)s/bat:alembic/versions

# version path separator; As mentioned above, this is the character used to split
# version_locations. The default within new alembic.ini files is "os", which uses os.pathsep.
# If this key is omitted entirely, it falls back to the legacy behavior of splitting on spaces and/or commas.
# Valid values for version_path_separator are:
#
# version_path_separator = :
# version_path_separator = ;
# version_path_separator = space
version_path_separator = os # Use os.pathsep. Default configuration used for new projects.

# set to 'true' to search source files recursively
# in each "version_locations" directory
# new in Alembic version 1.10
# recursive_version_locations = false

# the output encoding used when revision files
# are written from script.py.mako
# output_encoding = utf-8

sqlalchemy.url = postgresql://uta_admin:@localhost/uta


[post_write_hooks]
# post_write_hooks defines scripts or Python functions that are run
# on newly generated revision scripts. See the documentation for further
# detail and examples

# format using "black" - use the console_scripts runner, against the "black" entrypoint
# hooks = black
# black.type = console_scripts
# black.entrypoint = black
# black.options = -l 79 REVISION_SCRIPT_FILENAME

# lint with attempts to fix using "ruff" - use the exec runner, execute a binary
# hooks = ruff
# ruff.type = exec
# ruff.executable = %(here)s/.venv/bin/ruff
# ruff.options = --fix REVISION_SCRIPT_FILENAME

# Logging configuration
[loggers]
keys = root,sqlalchemy,alembic

[handlers]
keys = console

[formatters]
keys = generic

[logger_root]
level = WARN
handlers = console
qualname =

[logger_sqlalchemy]
level = WARN
handlers =
qualname = sqlalchemy.engine

[logger_alembic]
level = INFO
handlers =
qualname = alembic

[handler_console]
class = StreamHandler
args = (sys.stderr,)
level = NOTSET
formatter = generic

[formatter_generic]
format = %(levelname)-5.5s [%(name)s] %(message)s
datefmt = %H:%M:%S
2 changes: 1 addition & 1 deletion etc/global.conf
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ aligner = utaaa
fasta_directories =
aux/sequences2
aux/sequences
seqrepo = /usr/local/share/seqrepo/latest
seqrepo = /biocommons/dl.biocommons.org/seqrepo/master

#data/manual
#data/bic/sequences.fasta.bgz
Expand Down
Loading

0 comments on commit 01c13ca

Please sign in to comment.