- Workflow overview
- Choosing the n_clust argument
- Updating reference profiles
- On confidence scores
- Which genes to use
- Interpreting clustering results
- Targeted subclustering
The broad Insitutype workflow is as follows:
InSituType runs in 3 modes:
- Supervised: call only cell types defined in reference profiles. Set
nclust = 0
to run in fully supervised mode. - Unsupervised: de novo clustering, with no reference cell types
- Semi-supervised: find new clusters while also calling reference cell types.
Set reference_profiles = NULL
to run in unsupervised mode.
Considerations for choosing a workflow:
- Supervised is most convenient if you are confident that your reference profiles contain all the cell types in your dataset. However, many reference profiles from scRNA-seq don't fit spatial data well, so using reference profiles can be challenging.
- Semi-supervised mode is the most powerful but most challenging workflow. We use this in >80% of analyses. Success hinges on how well the reference profiles are calibrated to spatial data. InSituType tries to perform this calibration using anchor cells, but this does not always succeed.
- We recommend trying semi-supervised cell typing first, assuming there are new clusters you expect to discover.
- Unsupervised has no difficulty with poorly-calibrated reference profiles, but it requires you to name each cluster, which can be onerous. It may also fail to define distinctions that are important to you.
Keep in mind the following when selecting reference profiles:
- Quality of scRNA-seq references varies greatly. Finding mis-annotated cell types is not uncommon, and for smaller datasets, profiles of rare cell types will be noisy. Exercsie some skepticism.
- Large platform effects separate scRNA-seq and spatial platforms. When possible, use a reference from the same platform as your data.
- A large collection of single cell references can be found here: https://github.com/Nanostring-Biostats/cellprofilelibrary
- A growing collection of CosMx references is here: https://github.com/Nanostring-Biostats/CosMx-Cell-Profiles
We recommend choosing a slightly generous value of nclust
, then using refineClusters
to condense the resulting clusters. For example, if you're running semi-supervised cell typing and you expect to find 5 new clusters, set nclust = 8
. Or for unsupervised clustering with an expectation of 12 cell types, set nclust = 16
.
It's generally easy to tell when two clusters come from the same cell type: they'll be adjacent in UMAP space, and the flightpath plot will show them frequently confused with each other.
Final note: Insitutype splits big clusters with higher counts more aggressively than other clusters. For example, in a tumor study, it will subcluster tumor cells many times before it subclusters e.g. fibroblasts. The simplest solution is to increase nclust as needed, then condense the over-clustered cell type as desired.
Cell typing's biggest challenge is using a reference dataset from a different platform. Platform effects between scRNA-seq and spatial platforms can be profound. Insitutype has 3 treatments for reference profiles:
- Use the reference profile matrix as-is
- Choose anchor cells, then rescale genes based on estimated platform effects. (Less aggressive, only fits gene-level effects.)
- Choose anchor cells, then refit the reference profiles entirely. (Most aggressive, fits a new value for every gene x cell type.)
We suggest using the below flowchart to choose from among these options:
For more on starting with a coarse reference then subclustering, see the "Targeted subclustering" discussion further on.
Insitutype returns a posterior probability for each cell type call. In practice, we have found these probabilities to be overconfident. Below is an image from the preprint demonstrating this phenomenon. For various posterior probability bins, it shows the accuracy rate actually achieved (with a confidence interval).
So 100% confident probabilties appear to be accurate, but lower probabilities are overconfident. Also, remember that these probabilities are based on all the information available to the model. They don't consider that the model might be missing cell types, or that the reference profiles could be incorrect.
In short, the posterior probabilities are useful for differentiating strong from weak cell typing calls, but you should be conservative when choosing a threshold. We often use a threshold of 80%, calling cells below that confidence as "unclassified".
Insitutype was designed using 1000-plex CosMx data, where we found it most powerful to use all genes in the panel. In our new 6000-plex data, it's worth considering using Insitutype on a well-chosen subset of genes. As a rule of thumb, genes should be retained if either of the following applies:
- They have solidly above-background expression in the CosMx data
- They have moderate-to-high expression in at least one reference profile
For typical 6000plex experiments, we speculate that cell typing using somewhere between 3000-5000 genes would be optimal.
Once Insitutype has run, take time to scrutinize the results. You'll need to:
- Confirm cell types from the reference profiles are correct
- Interpret new clusters
First, we recommend the following QC plots:
Example code for generating the above profiles heatmap:
pdf("<writehere.pdf>", height = 20, width = 6)
mat <- res$profiles # ("res" is the insitutype output)
mat <- sweep(mat, 1, pmax(apply(mat, 1 ,max), 0.1), "/")
pheatmap(mat, col = colorRampPalette(c("white", "darkblue"))(100),
fontsize_row = 5)
dev.off()
We have found the below workflows to be effective and efficent:
This is an advanced method. Sometimes it can be hard to subcluster a cell type if many of its genes are impacted by contamination from segmentation errors. Immune cells in the context of tumors are a good example. To subcluster say T-cells in a tumor, you might initially call a single T-cell cluster. Then, considering just these cells and just the genes unlikely to be contaminated in T-cells (genes with high T-cell expression or with low expression in surrounding cell types), run unsupervised Insitutype.