Skip to content

Genomics datastructures using Apache Arrow

Notifications You must be signed in to change notification settings

disaggr/ArrowSAM

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArrowSAM

ArrowSAM is an in-memory Sequence Alignment/Map (SAM) representation which uses Apache Arrow framework (A cross-language development platform for in-memory data) and Plasma (Shared-Memory) Object Store to store and process SAM columnar data in-memory.

Citing ArrowSAM

The following paper describes the ArrowSAM format and its usage to speedup genomics pipelines. If you use ArrowSAM in your work, please cite the following paper.

Ahmad et al., (2020). "ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow", ICCAIS. doi.org/10.1109/ICCAIS48893.2020.9096725

Ahmad et al., "Optimizing performance of GATK workflows using Apache Arrow In-Memory data framework", BMC Genomics, presented at APBC2020. https://doi.org/10.1186/s12864-020-07013-y

This repo contains following three components:

  1. ArrowSAM (In-memory SAM data representation) integrated BWA-MEM, Picard and GATK tools.

  2. A Singularity container def file (To create an environment to use all Apache Arrow related tools and libraries for ArrowSAM).

  3. Scripts to run different GATK best practices recommended workflows (using different in-memory data placement techniques like ArrowSAM, ramDisk and pipes for fast processing) to run complete DNA analysis pipeline efficiently.

Note: ArrowSAM and all other workflows are based on single node, multi-core machines.

How to run

  1. Install Singularity container

  2. Download our Singularity script and generate singularity image (this image contains all Arrow related packges necessary for building/compiling BWA-MEM, Picard and GATK)

  3. Now enter into generated image using command:

     sudo singularity shell <image_name>.simg
    
  4. Download BWA-MEM inside image

     git clone https://github.com/tahashmi/bwa.git
    
  5. Go into bwa dir and compile BWA-MEM:

     cd bwa
     make
    
  6. Now you can run BWA-MEM.

About

Genomics datastructures using Apache Arrow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 68.2%
  • Shell 19.7%
  • Singularity 12.1%