Apply diffusion models to deconvolute highly multiplexed DIA-MS/MS data by conditioning on MS1 signals to generate cleaner MS2 data for downstream analysis.
As biological analysis machines and methodologies become more sophisticated and capable of handling more complex samples, the data they output also become more complicated to analyze. Modern generative machine learning techniques such as diffusion and score-based modeling have been used with great success in the domains of image, video, text, and audio data.
We aim to apply the same principles to highly multiplexed biological data signals and leverage the ability of generative models to learn the underlying distribution of the data, instead of just the boundaries using discriminative methods. We hope to apply diffusion models to signal denoising, specifically the deconvolution of highly multiplexed DIA-MS/MS data.
DIA-MS/MS features two types of data: MS1 and MS2. In MS1 data, information such as mass-to-charge ratio and chromatography elution time are recorded for entire peptides as they are analyzed. In MS2 data, the same information is recorded for the set MS2 peptide fragments belonging to the MS1 peptides onto the same data map. This means that although the data between MS1 and MS2 are correlated, the MS2 data can be highly multiplexed with signals from multiple MS1 peptides showing up.
Our project aims to train a diffusion model and condition it on MS1 data to deconvolute the corresponding MS2 signal, effectively simulating the case where the MS1 scan captured fewer peptides in its analysis window, producing cleaner MS2 data. This would be extremely useful for downstream analysis, identification, and quantification tasks.
We currently have access to a set of clean MS2 data which we plan to use to generate synthetic multiplexed MS2 data, and to use the corresponding clean MS1 data as a conditioning factor to re-extract the clean MS2. This should be an effective proof of concept for diffusion-based denoising of biological signal data.
First, clone the repository to your local machine:
git clone [email protected]:hackbio-ca/diffusion-deconvolution-dia-msms.git
cd diffusion-deconvolution-dia-msms
It's recommended to use a virtual environment. You can create one using Python's built-in venv module:
virtualenv dquartic
source dquartic/bin/activate
Alternatively, you can use conda:
conda create -n dquartic python=3.9
conda activate dquartic
To install the library and its dependencies, run:
pip install .
The library has a CLI for training the diffusion model.
$ dquartic train --help
Usage: dquartic train [OPTIONS]
Train a DDIM model on the DIAMS dataset.
Options:
--epochs INTEGER Number of epochs to train
--warmup-epochs INTEGER Number of warmup epochs for learning rate scheduler
--batch-size INTEGER Batch size for training
--learning-rate FLOAT Learning rate for optimizer
--hidden-dim INTEGER Hidden dimension for the model
--num-heads INTEGER Number of attention heads
--num-layers INTEGER Number of transformer layers
--normalize TEXT Normalization method. (None, minmax)
--ms2-data-path TEXT Path to MS2 data
--ms1-data-path TEXT Path to MS1 data
--checkpoint-path TEXT Path to save the best model
--use-wandb Enable Weights & Biases logging
--threads INTEGER Number of threads for data loading
--help Show this message and exit.
There is an example bash script for running a training example, which can be subitted via SLURM job
sbatch --job-name=myjob \
--output=myjob_%j.out \
--error=myjob_%j.err \
--time=10:00:00 \
--ntasks=1 \
--gres=gpu:1 \
--cpus-per-task=4 \
--mem=16G \
run_trainer.sh
Contributions are welcome! If you'd like to contribute, please open an issue or submit a pull request. See the contribution guidelines for more information.
If you have any issues or need help, please open an issue or contact the project maintainers.
This project is licensed under the BSD-3 License.