Skip to content

Latest commit

 

History

History
127 lines (80 loc) · 6.17 KB

File metadata and controls

127 lines (80 loc) · 6.17 KB

Bioinformatics model for protein therapeutics

NOTE

This is not a Resilience project and the code/opinions/recommendations are my personal work and mine alone.

Synopsis

We'll use the Therapeutics Data Commons Python package to download open-source (CC BY 4.0) datasets that are meaningful in pharmaceutical research. In this repository, we'll use a dataset called TCR-Epitope Binding Affinity. The code will be in the notebook notebooks/tdc-tcr-epitope-binding-affinity.ipynb.

TCR-epitope binding

We show how to create a deep learning model for predicting if a T-cell receptor (TCR) and protein epitope will bind to each other. A model that can predict how well a TCR bindings to an epitope can lead to more effective treatments that use immunotherapy. For example, in anti-cancer therapies it is important for the T-cell receptor to bind to the protein marker in the cancer cell so that the T-cell (actually the T-cell's friends in the immune system) can kill the cancer cell.

HuggingFace is a Python library that provides a "one-stop shop" to train and deploy AI models. In this case, we use HuggingFace to get a pre-trained version of Facebook's open-source Evolutionary Scale Model (ESM-2). This model turns protein sequences into a vector of numbers that the computer can use in a mathematical model. The vector of numbers uniquely encodes (aka embeds) a protein sequence in the same way that the Dewey Decimal System and ISBN uniquely encode a book into a set of numbers (and letters). This representation is also referred to as a latent space.

Then, we'll show how to combine this embedding with a simple neural network to create a binary classifier for the TCR-epitope binding affinity prediction (True=They Bind, False=They don't bind).

encoder-decoder Dewey Decimal

Getting the dataset

The Therapeutics Data Commons (TDC) dataset can be automatically downloaded via their open-sourced Python library. However, it will take significant time (hours) to compute the Evolutionary Scale Model (ESM-2) embedding vectors.

To save you time, I've uploaded the preprocessed data as Pickle files on Zenodo. If you download those 3 files, then the Python script will skip the embedding step.

Running things locally

Creating the conda environment

To install all of the required Python packages, you'll need to create a conda environment. Follow the conda website directions to download and install conda (Anaconda works too). Once you have conda installed, run the command:

conda env create -f environment.yml

Once the environment is successfully created, activate it by running:

conda activate tdc-tcr-epitope-binding-affinity-env

At this point you should be able to run the Jupyter Notebook:

jupyter notebook notebooks/tdc-tcr-epitope-binding-affinity-model.ipynb

Running things in a container

If you don't want to install conda, then you can run the Jupyter notebook from within a container.

Apptainer

To create an Apptainer, run the command:

apptainer build tdc-tcr-epitope-binding-affinity.sif tdc-tcr-epitope-binding-affinity.def

Then, run:

apptainer shell tdc-tcr-epitope-binding-affinity.sif

At this point, you'll be able to run the Jupyter Notebook:

jupyter notebook notebooks/tdc-tcr-epitope-binding-affinity-model.ipynb

Docker

To create a Docker, run the command:

docker build -t tdc-tcr-epitope-binding-affinity .

Now you can run:

docker run tdc-tcr-epitope-binding-affinity

And finally you can run the Jupyter Notebook:

jupyter notebook notebooks/tdc-tcr-epitope-binding-affinity-model.ipynb

References

  1. Weber, Anna, Jannis Born, and María Rodriguez Martínez. "TITAN: T-cell receptor specificity prediction with bimodal attention networks." Bioinformatics 37.Supplement_1 (2021): i237-i244.

  2. Bagaev, Dmitry V., et al. "VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium." Nucleic Acids Research 48.D1 (2020): D1057-D1062.

  3. Dines, Jennifer N., et al. "The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database." medRxiv (2020).

  4. Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, Rob Fergus. "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences." bioRxiv 622803; doi: https://doi.org/10.1101/622803  https://www.biorxiv.org/content/10.1101/622803v4

  5. Zeming Lin et al, Evolutionary-scale prediction of atomic-level protein structure with a language model, Science (2023). DOI: 10.1126/science.ade2574 https://www.science.org/doi/10.1126/science.ade2574

ESM2

https://huggingface.co/facebook/esm2_t36_3B_UR50D

Checkpoint name Number of layers Number of parameters
esm2_t48_15B_UR50D 48 15B
esm2_t36_3B_UR50D 36 3B
esm2_t33_650M_UR50D 33 650M
esm2_t30_150M_UR50D 30 150M
esm2_t12_35M_UR50D 12 35M
esm2_t6_8M_UR50D 6 8M

Licenses

The TDC dataset is a CC-BY-4.0. The ESM-2 model is MIT license.