Skip to content

Commit

Permalink
Merge pull request #1 from invitae/batching-support
Browse files Browse the repository at this point in the history
feat: Batching Support and Improved Performance
  • Loading branch information
kazmiekr authored May 9, 2023
2 parents 1b18e78 + 6c9db04 commit a97f163
Show file tree
Hide file tree
Showing 31 changed files with 2,680 additions and 580 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
GRCh37.primary_assembly.genome.fa.gz
GRCh37.primary_assembly.genome.fa.gz.fxi
pangolin/__pycache__
tests/__pycache__
54 changes: 42 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,17 @@ Pangolin can be run on Google Colab, which provides free acess to GPUs and other
See below for information on usage and local installation.

### Installation
* Prerequisites: Python 3.6 or higher and conda, which can both be installed using Miniconda: https://docs.conda.io/en/latest/miniconda.html
* Install PyTorch: https://pytorch.org/get-started/locally/
* If a supported GPU is available, installation with GPU support is recommended (choose an option under "Compute Platform")
* Install other dependencies:
```
conda install -c conda-forge pyvcf
pip install gffutils biopython pandas pyfastx
```
* Prerequisites: Python 3.8 or higher
* Poetry: See https://python-poetry.org/docs/#installation
* Install Pangolin:
```
git clone https://github.com/tkzeng/Pangolin.git
git clone https://github.com/invitae/Pangolin.git
cd Pangolin
pip install .
poetry install
```
* Activate env
```
poetry shell
```

### Usage (command-line)
Expand Down Expand Up @@ -52,13 +50,13 @@ See below for information on usage and local installation.
```
See full options below:
```
usage: pangolin [-h] [-c COLUMN_IDS] [-m {False,True}] [-s SCORE_CUTOFF] [-d DISTANCE] variant_file reference_file annotation_file output_file
usage: pangolin [-h] [-c COLUMN_IDS] [-m {False,True}] [-s SCORE_CUTOFF] [-d DISTANCE] [-b BATCH_SIZE] [-v] variant_file reference_file annotation_file output_file
positional arguments:
variant_file VCF or CSV file with a header (see COLUMN_IDS option).
reference_file FASTA file containing a reference genome sequence.
annotation_file gffutils database file. Can be generated using create_db.py.
output_file Prefix for output file. Will be a VCF/CSV if variant_file is VCF/CSV.
output_file Name of output file.
optional arguments:
-h, --help show this help message and exit
Expand All @@ -70,12 +68,44 @@ See below for information on usage and local installation.
Output all sites with absolute predicted change in score >= cutoff, instead of only the maximum loss/gain sites.
-d DISTANCE, --distance DISTANCE
Number of bases on either side of the variant for which splice scores should be calculated. (Default: 50)
-b BATCH_SIZE, --batch_size BATCH_SIZE
Number of variants to batch together (Default: 0). Use this to improve GPU optimization
-v, --verbose Enable additional debugging output
--enable_gtf_cache Enable caching of GTF database into memory
```
### Usage (custom)
See `scripts/custom_usage.py`
### Batching Support
Invitae added batching support in April 2023 to get better GPU optimization. Variants are read in batches and then distributed into collections by tensor sizes and then run through the GPU in larger batches.
After batches are run, data is put back together in the original order and written to disk. You can control the batching via the `-b` parameter documented above.
![Batching](docs/Pangolin_Batching_Indexing.png)
### GTF DB Caching
If you are running a larger batch of variants, you can gain additional performance by caching the gtf database into memory.
You can enable this behavior with `--enable_gtf_cache`. With this enabled, it'll dump the SQLite database into memory using
interval trees for the gene information for quick lookups without hitting the disk.
## Testing
There are unit tests available that run some small scale sets of predictions using data on chromosome 19, see details in
the tests about how the data was generated.
```
poetry run pytest
```
Testing with coverage
```
poetry run coverage run --source=pangolin -m pytest && poetry run coverage report -m
```
### Citation
If you use Pangolin, please cite:
Expand Down
Binary file added docs/Pangolin_Batching_Indexing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/Pangolin_Batching_Overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
257 changes: 0 additions & 257 deletions pangolin/.fuse_hidden0000252700000002

This file was deleted.

Loading

0 comments on commit a97f163

Please sign in to comment.