FAQs and advanced methods

Topics

Workflow overview
Choosing the n_clust argument
Updating reference profiles
On confidence scores
Which genes to use
Interpreting clustering results
Targeted subclustering

Workflow overview

The broad Insitutype workflow is as follows:

Unsupervised vs. Supervised vs. Semi-supervised cell typing

InSituType runs in 3 modes:

Supervised: call only cell types defined in reference profiles. Set nclust = 0 to run in fully supervised mode.
Unsupervised: de novo clustering, with no reference cell types
Semi-supervised: find new clusters while also calling reference cell types. Set reference_profiles = NULL to run in unsupervised mode.

Considerations for choosing a workflow:

Supervised is most convenient if you are confident that your reference profiles contain all the cell types in your dataset. However, many reference profiles from scRNA-seq don't fit spatial data well, so using reference profiles can be challenging.
Semi-supervised mode is the most powerful but most challenging workflow. We use this in >80% of analyses. Success hinges on how well the reference profiles are calibrated to spatial data. InSituType tries to perform this calibration using anchor cells, but this does not always succeed.
We recommend trying semi-supervised cell typing first, assuming there are new clusters you expect to discover.
Unsupervised has no difficulty with poorly-calibrated reference profiles, but it requires you to name each cluster, which can be onerous. It may also fail to define distinctions that are important to you.

Choosing reference profiles

Keep in mind the following when selecting reference profiles:

Quality of scRNA-seq references varies greatly. Finding mis-annotated cell types is not uncommon, and for smaller datasets, profiles of rare cell types will be noisy. Exercsie some skepticism.
Large platform effects separate scRNA-seq and spatial platforms. When possible, use a reference from the same platform as your data.
A large collection of single cell references can be found here: https://github.com/Nanostring-Biostats/cellprofilelibrary
A growing collection of CosMx references is here: https://github.com/Nanostring-Biostats/CosMx-Cell-Profiles

Choosing nclust

We recommend choosing a slightly generous value of nclust, then using refineClusters to condense the resulting clusters. For example, if you're running semi-supervised cell typing and you expect to find 5 new clusters, set nclust = 8. Or for unsupervised clustering with an expectation of 12 cell types, set nclust = 16. It's generally easy to tell when two clusters come from the same cell type: they'll be adjacent in UMAP space, and the flightpath plot will show them frequently confused with each other.

Final note: Insitutype splits big clusters with higher counts more aggressively than other clusters. For example, in a tumor study, it will subcluster tumor cells many times before it subclusters e.g. fibroblasts. The simplest solution is to increase nclust as needed, then condense the over-clustered cell type as desired.

Updating reference profiles

Cell typing's biggest challenge is using a reference dataset from a different platform. Platform effects between scRNA-seq and spatial platforms can be profound. Insitutype has 3 treatments for reference profiles:

Use the reference profile matrix as-is
Choose anchor cells, then rescale genes based on estimated platform effects. (Less aggressive, only fits gene-level effects.)
Choose anchor cells, then refit the reference profiles entirely. (Most aggressive, fits a new value for every gene x cell type.)

We suggest using the below flowchart to choose from among these options:

For more on starting with a coarse reference then subclustering, see the "Targeted subclustering" discussion further on.

Confidence Scores

Insitutype returns a posterior probability for each cell type call. In practice, we have found these probabilities to be overconfident. Below is an image from the preprint demonstrating this phenomenon. For various posterior probability bins, it shows the accuracy rate actually achieved (with a confidence interval).

So 100% confident probabilties appear to be accurate, but lower probabilities are overconfident. Also, remember that these probabilities are based on all the information available to the model. They don't consider that the model might be missing cell types, or that the reference profiles could be incorrect.

In short, the posterior probabilities are useful for differentiating strong from weak cell typing calls, but you should be conservative when choosing a threshold. We often use a threshold of 80%, calling cells below that confidence as "unclassified".

Which genes to use

Insitutype was designed using 1000-plex CosMx data, where we found it most powerful to use all genes in the panel. In our new 6000-plex data, it's worth considering using Insitutype on a well-chosen subset of genes. As a rule of thumb, genes should be retained if either of the following applies:

They have solidly above-background expression in the CosMx data
They have moderate-to-high expression in at least one reference profile

For typical 6000plex experiments, we speculate that cell typing using somewhere between 3000-5000 genes would be optimal.

Interpreting clustering results

Once Insitutype has run, take time to scrutinize the results. You'll need to:

Confirm cell types from the reference profiles are correct
Interpret new clusters

First, we recommend the following QC plots:

Example code for generating the above profiles heatmap:

pdf("<writehere.pdf>", height = 20, width = 6)
mat <- res$profiles  # ("res" is the insitutype output)
mat <- sweep(mat, 1, pmax(apply(mat, 1 ,max), 0.1), "/")
pheatmap(mat, col = colorRampPalette(c("white", "darkblue"))(100),
         fontsize_row = 5)
dev.off()

We have found the below workflows to be effective and efficent:

Targeted subclustering

This is an advanced method. Sometimes it can be hard to subcluster a cell type if many of its genes are impacted by contamination from segmentation errors. Immune cells in the context of tumors are a good example. To subcluster say T-cells in a tumor, you might initially call a single T-cell cluster. Then, considering just these cells and just the genes unlikely to be contaminated in T-cells (genes with high T-cell expression or with low expression in surrounding cell types), run unsupervised Insitutype.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQs.md

FAQs.md

FAQs and advanced methods

Topics

Workflow overview

Unsupervised vs. Supervised vs. Semi-supervised cell typing

Choosing reference profiles

Choosing nclust

Updating reference profiles

Confidence Scores

Which genes to use

Interpreting clustering results

Targeted subclustering

Files

FAQs.md

Latest commit

History

FAQs.md

File metadata and controls

FAQs and advanced methods

Topics

Workflow overview

Unsupervised vs. Supervised vs. Semi-supervised cell typing

Choosing reference profiles

Choosing nclust

Updating reference profiles

Confidence Scores

Which genes to use

Interpreting clustering results

Targeted subclustering