Skip to content

Automate unsupervised machine learning cell type identification using both protein expressions and cell spatial neighborhood information for multiplexed in situ imaging data. No training dataset with cell type labels is required.

License

Notifications You must be signed in to change notification settings

dan11mcguire/CELESTA

 
 

Repository files navigation

CELESTA

Overview

CELESTA (CELl typE identification with SpaTiAl information) is an algorithm aiming to perform automate cell type identification for multiplexed in situ imaging data. CELESTA makes use of both protein expressions and cell spatial neighborhood information from segmented imaging data for the cell type identification.

The pre-saved imaging data is taken from reg009 of the published CODEX data Schurch et al. Cell,2020 for illustration purpose.

  • CreateCelestaObject() Creates an object running CELESTA. It requires a title to create the project, segmented imaging data file and prior knowledge file for cell-type signature matrix (user-defined).
  • FilterCells() This step intends to fill out questionable cells due to imaging artifacts, segmentation error etc.
  • PlotExpProb() This function plots the calculated expression probabilities for each marker included in the user-defined prior cell-type signature matrix. It can be used to visualize and help with setting the thresholds for whether a marker is expressed or not.
  • AssignCells() This is the main function to assign cell types with an iterative EM algorithm.
  • PlotCellsAnyCombination() This function can be used to plot the cells with identified cell types with the XY coordinates from segmentation.

Installation

You can install the development version of CELESTA

# install.packages("devtools")
devtools::install_github("plevritis/CELESTA")

Dependency

CELESTA requires dependency on the following R packages: - Rmixmod: for performing Gaussian Mixture Modeling - spdep: for obtaining spatial neighborhood information - zeallot: for R code styling. Provides a %<-% operator to perform multiple, unpacking, and destructuring assignment in R. - ggplot2 reshape2: for plotting

Usage

library(CELESTA)
library(Rmixmod)
library(spdep)
library(ggplot2)
library(reshape2)
library(zeallot)

### The pre-saved imaging data is taken from reg009 of the published CODEX data Schurch et al. Cell,2020
### Create CELESTA object. It requires a title for the project. 
### It also required the segmented input file and user-defined cell-type signature matrix.
### Please refer to the Inputs session below.
CelestaObj <- CreateCelestaObject(project_title = "project_title",prior_marker_info,imaging_data)

### Filter out questionable cells. 
### A cell with every marker having expression probability higher than 0.9 are filtered out. 
### And A cell with every marker having expression probability lower than 0.4 are filtered out. 
### User can define the thresholds based on inspecting their data. 
### **This step is optional.** We suggest starting without running this step to see whether there are many doublets/triplets.
CelestaObj <- FilterCells(CelestaObj,high_marker_threshold=0.9, low_marker_threshold=0.4)

### Assign cell types. 
### max_iteration is used to define the maximum iterations allowed in the EM algorithm per round. 
### cell_change_threshold is a user-defined ending condition for the EM algorithm. 
### For example, 0.01 means that when fewer than 1% of the total number of cells do not change identity, the algorithm will stop.
CelestaObj <- AssignCells(CelestaObj,max_iteration=10,cell_change_threshold=0.01,
                          high_expression_threshold_anchor=high_marker_threshold_anchor,
                          low_expression_threshold_anchor=low_marker_threshold_anchor,
                          high_expression_threshold_index=high_marker_threshold_iteration,
                          low_expression_threshold_index=low_marker_threshold_iteration)

### After the AssignCells() function, the CELESTA assigned cell types will be stored in the CelestaObj
### in the field called final_cell_type_assignment with each row corresponding to a cell. 
### The final_cell_type_assignment has assignment for each round stored in each column, the final 
### cell types and the corresponding cell type number corresponding to the cell type specified in 
### the cell-type signature matrix (please see Input section below).

### Plot cells with CELESTA assigned cell types.
### The cell_number_to_use corresponds to the defined numbers in the prior cell-type signature matrix.
### For example, 1 corresponds to endothelial cell, 2 corresponds to tumor cell.
### The program will plot the corresponding cell types given in the "cell_number_to_use" parameter.
### To plot the "unknown" cells that are left unassigned by CELESTA, include 0 in the list.
### The default color for unknown cells is gray.
### The size of the cells plotted can be modified by changing the parameter test_size.
PlotCellsAnyCombination(cell_type_assignment_to_plot=CelestaObj@final_cell_type_assignment[,(CelestaObj@total_rounds+1)],
                        coords = CelestaObj@coords,
                        prior_info = prior_marker_info,
                        cell_number_to_use=c(1,2,3),
                        cell_type_colors=c("yellow","red","blue"),
                        test_size=1)

### To include unknown cells
PlotCellsAnyCombination(cell_type_assignment_to_plot=CelestaObj@final_cell_type_assignment[,(CelestaObj@total_rounds+1)],
                        coords = CelestaObj@coords,
                        prior_info = prior_marker_info,
                        cell_number_to_use=c(0,1,2,3),cell_type_colors=c("yellow","red","blue"))

### plot expression probability
PlotExpProb(coords=CelestaObj@coords,
            marker_exp_prob=CelestaObj@marker_exp_prob,
            prior_marker_info = prior_marker_info,
            save_plot = TRUE)

Inputs

CELESTA requires two inputs:
1. Segmented imaging data:
a dataframe with rows as the cells, and needs to have (1) two columns named X and Y to define the XY coordinates of the cells and (2) other columns having the protein marker expressions for each cell

Below is an example of the segmented imaging file header

An example of the segmented imaging file example

2. User-defined cell-type signature matrix:
(1) The first column has to contain the cell types to be inferred
(2) The second column has the lineage information for each cell type. The lineage information has three numbers connected by “_” (underscore). The first number indicates round.Cell types with the same lineage level are inferred at the same round. Increasing number indicates increase cell-type resolution. For example, immune cells -> CD3+ T cells –> CD4+ T cells. The third number is a number assigned to the cell type, i.e, cell type number. The middle number tells the previous lineage cell type number for the current cell type. For example, the middle number for CD3+ T cells is 5, because it is a subtype of immune cells which have cell type number assigned to 5.
(3) Starting from column three, each column is a protein marker. If the protein marker is known to be expressed for that cell type, then it is denoted by “1”. If the protein marker is known to not express for a cell type, then it is denoted by “0”. If the protein marker is irrelevant or uncertain to express for a cell type, then it is denoted by “NA”.
(4) More examples of the user-defined cell-type signature matrix is provided under folder:data.

Below is an example of cell-type signature matrix based on imaging panel used in Schurch et al. Cell, 2020.

An example of cell-type signature matrix based on imaging panel used in Schurch et al. Cell, 2020

Outputs

CELESTA outputs: 1. After running AssignCells() function, CELESTA will output a .csv file with the cell type assignment to each cell for each round and the final combined cell types.
2. In addition, users can access the results in the CELESTA object under the slot “final_cell_type_assignment”. The anchor cells defined for each round can be found under the slot “anchor_cell_type_assignment”.

CELESTA can also plot the assigned cells by using the PlotCellsAnyCombination() function. An example output image is shown below: An example plot of assigned cell types Users can compare the output with the original images. An example is shown below:
Please note: CODEX images preprocess with Akoya Biosciences software stitched the image tiles in a flipped way. So in some cases, for the comparisons, the image needs to be flipped.
An example of comparison

How to define thresholds

In the AssignCells() function, it requires four vectors to define the high and low thresholds for each cell type. The length of the vector equals to the total number of cell types defined in the cell-type signature matrix.Examples of the thresholds are provided under the folder:data.
We would suggest start with the default thresholds and modify them by comparing the results with the original staining demonstrated below.
The two vectors are required for defining the “high_expression_threshold”, one for anchor cells and one for index cells(non-anchor cells). The thresholds defined how much the marker expression probability is in order to be considered as expressed. An example for defining high_expression_threshold is shown below: An example of high marker threshold
You can also specify the threholds using:

CelestaObj <- AssignCells(CelestaObj,max_iteration=10,cell_change_threshold=0.01,
                          high_expression_threshold_anchor=c(0.7,0.7,0.7,0.7,0.7,0.8,0.9,0.9),
                          low_expression_threshold_anchor=c(1,1,1,1,1,1,1,1),
                          high_expression_threshold_index=c(0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5),
                          low_expression_threshold_index=c(1,1,1,1,1,1,1,1))

The length of the vectors for the thresholds correspond to the number of cell types. The order of the thresholds correpond to the same order in the defined cell-type signature matrix.

To find the proper threshold, the PlotExpProb() function can be applied. Because the segmented data may have some compensation in the values which are the inputs to CELESTA, the expression probabilities are calculated based on the segmented data. It’s useful to compare the expression probabilities with the CODEX staining for each marker.
For example, for endothelial cells, if we plot the expression probabilities of CD31 (left) and compare with the CD31 staining, approximately 0.9 and 0.8 would be the right threshold for defining how much the cell should express CD31. Please note: It is suggested that for anchor cells, use a slightly higher threshold than index cells.
An example of CD31 Another example, for tumor cells, if we plot the expression probabilities of Cytokerain (left) and compare with the Cytokeratin staining, approximately 0.9 and 0.8 would be the right threshold for defining how much the cell should express Cytokeratin. Please note: It is suggested that for anchor cells, use a slightly higher threshold than index cells. An example of CK

The two vectors are required for defining the “low_marker_threshold”, one for anchor cells and one for index cells. The thresholds defined how much the marker expression probability is in order to be considered as not expressed. Normally 1 is assigned to this value unless there are a lot of doublets or co-staining in the data. The Low expression threshold default values in general are robust, and thus we recommend testing the High expression threshold values.

An example for defining low_marker_threshold is shown below:

An example of low marker threshold

Getting help

If you encounter a clear bug, please file an issue with a minimal reproducible example on GitHub. For questions and other discussion, please use community.rstudio.com.

About

Automate unsupervised machine learning cell type identification using both protein expressions and cell spatial neighborhood information for multiplexed in situ imaging data. No training dataset with cell type labels is required.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%