Merge pull request #1 from invitae/batching-support

feat: Batching Support and Improved Performance
invitae · May 9, 2023 · a97f163 · a97f163
2 parents 1b18e78 + 6c9db04
commit a97f163
Show file tree

Hide file tree

Showing 31 changed files with 2,680 additions and 580 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,2 +1,4 @@
 GRCh37.primary_assembly.genome.fa.gz
 GRCh37.primary_assembly.genome.fa.gz.fxi
+pangolin/__pycache__
+tests/__pycache__
diff --git a/README.md b/README.md
@@ -7,19 +7,17 @@ Pangolin can be run on Google Colab, which provides free acess to GPUs and other
 See below for information on usage and local installation.
 
 ### Installation
-* Prerequisites: Python 3.6 or higher and conda, which can both be installed using Miniconda: https://docs.conda.io/en/latest/miniconda.html
-* Install PyTorch: https://pytorch.org/get-started/locally/
-  * If a supported GPU is available, installation with GPU support is recommended (choose an option under "Compute Platform")
-* Install other dependencies:
-  ```
-  conda install -c conda-forge pyvcf
-  pip install gffutils biopython pandas pyfastx
-  ```
+* Prerequisites: Python 3.8 or higher
+* Poetry: See https://python-poetry.org/docs/#installation
 * Install Pangolin:
   ```
-  git clone https://github.com/tkzeng/Pangolin.git
+  git clone https://github.com/invitae/Pangolin.git
   cd Pangolin
-  pip install .
+  poetry install
+  ```
+* Activate env
+  ```
+  poetry shell
   ```
 
 ### Usage (command-line)
@@ -52,13 +50,13 @@ See below for information on usage and local installation.
     ```
     See full options below:
     ```
-    usage: pangolin [-h] [-c COLUMN_IDS] [-m {False,True}] [-s SCORE_CUTOFF] [-d DISTANCE] variant_file reference_file annotation_file output_file
+    usage: pangolin [-h] [-c COLUMN_IDS] [-m {False,True}] [-s SCORE_CUTOFF] [-d DISTANCE] [-b BATCH_SIZE] [-v] variant_file reference_file annotation_file output_file
 
     positional arguments:
       variant_file          VCF or CSV file with a header (see COLUMN_IDS option).
       reference_file        FASTA file containing a reference genome sequence.
       annotation_file       gffutils database file. Can be generated using create_db.py.
-      output_file           Prefix for output file. Will be a VCF/CSV if variant_file is VCF/CSV.
+      output_file           Name of output file.
 
     optional arguments:
       -h, --help            show this help message and exit
@@ -70,12 +68,44 @@ See below for information on usage and local installation.
                             Output all sites with absolute predicted change in score >= cutoff, instead of only the maximum loss/gain sites.
       -d DISTANCE, --distance DISTANCE
                             Number of bases on either side of the variant for which splice scores should be calculated. (Default: 50)
+      -b BATCH_SIZE, --batch_size BATCH_SIZE
+                            Number of variants to batch together (Default: 0). Use this to improve GPU optimization
+      -v, --verbose         Enable additional debugging output
+      --enable_gtf_cache    Enable caching of GTF database into memory
     ```
 
 ### Usage (custom)
 
 See `scripts/custom_usage.py`
 
+### Batching Support
+
+Invitae added batching support in April 2023 to get better GPU optimization. Variants are read in batches and then distributed into collections by tensor sizes and then run through the GPU in larger batches.
+After batches are run, data is put back together in the original order and written to disk. You can control the batching via the `-b` parameter documented above.
+
+![Batching](docs/Pangolin_Batching_Indexing.png)
+
+### GTF DB Caching
+
+If you are running a larger batch of variants, you can gain additional performance by caching the gtf database into memory. 
+You can enable this behavior with `--enable_gtf_cache`. With this enabled, it'll dump the SQLite database into memory using
+interval trees for the gene information for quick lookups without hitting the disk.
+
+## Testing
+
+There are unit tests available that run some small scale sets of predictions using data on chromosome 19, see details in 
+the tests about how the data was generated.
+
+```
+poetry run pytest
+```
+
+Testing with coverage
+
+```
+poetry run coverage run --source=pangolin -m pytest && poetry run coverage report -m
+```
+
 ### Citation
 
 If you use Pangolin, please cite:

diff --git a/docs/Pangolin_Batching_Indexing.png b/docs/Pangolin_Batching_Indexing.png
diff --git a/docs/Pangolin_Batching_Overview.png b/docs/Pangolin_Batching_Overview.png
diff --git a/pangolin/.fuse_hidden0000252700000002 b/pangolin/.fuse_hidden0000252700000002