fastReseg_flag_all_errors()
is a wrapper function to process multiple files of one dataset for segmentation error detection in transcript level. The function reformats the individual transcript data.frame
to have unique IDs and a global coordinate system and save into disk, then scores each cell for segmentation error and flags transcripts that have low goodness-of-fit to current cells.
- Counts matrix for entire data set, cells X genes.
- Either a vector of cluster assignments for each cell, or a matrix of genes X clusters reference profiles that could be used for internal cluster assignment.
- Either a
data.frame
of transcript level information with unique cell ids, or adata.frame
with each row for each individual file of per FOV transcriptdata.frame
within which the coordinates and cell ids are unique, columns include the file path of per FOV transcriptdata.frame
file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire data set. - additional arguments describing input data structures and output data format.
- additional arguments for finer control on error detection and flagging.
A list, with the following elements:
- refProfiles: a genes x clusters matrix of cluster-specific reference profiles used in resegmenation pipeline
- baselineData: a list of two matrices in cluster x percentile format for the cluster-specific percentile distribution of per cell value;
span_score
is for the average per molecule transcript tLLR score of each cell,span_transNum
is for the transcript number of each cell. - ctrl_genes: a vector of control genes whose transcript scores are set to fixed value for all cell types, return when
ctrl_genes
is notNULL
. - combined_modStats_ToFlagCells: a
data.frame
for spatial modeling statistics of each cell for all cells in the data set, output ofscore_cell_segmentation_error
function - combined_flaggedCells: a list with each element to be a vector of
UMI_cellID
for cells flagged for potential cell segmentation errors within each FOV - trimmed_perCellExprs: a gene x cell count sparse matrix where all putative contaminating transcripts are trimmed, return when
return_trimmed_perCell = TRUE
- flagged_transDF_list: a list of per-FOV transcript
data.frame
with flagging information inSVM_class
column, return whentransDF_export_option = 2
When transDF_export_option =1
, the function would save the each per FOV output as individual file in path_to_output
directory; flagged_transDF
, modStats_ToFlagCells
and classDF_ToFlagTrans
would be saved as csv file, respectively.
- flagged_transDF: a transcript
data.frame
for each FOV, with columns for unique IDs of transcriptsUMI_transID
and cellsUMI_cellID
, for global coordinate systemx
,y
,z
, and for the goodness-of-fit in original cell segmentSMI_class
; the original per FOV cell ID and pixel/index-based coordinates systems are saved under columns,CellId
,pixel_x
,pixel_y
,idx_z
. - modStats_ToFlagCells: a
data.frame
for spatial modeling statistics of each cell, output ofscore_cell_segmentation_error()
function. - classDF_ToFlagTrans:
data.frame
for the class assignment of transcripts within putative wrongly segmented cells, output offlag_bad_transcripts()
functions.
fastReseg_full_pipeline()
is a wrapper for full resegmentation pipeline using internal reference profiles and cutoffs. This function first estimates proper reference profiles and cutoffs from the provided data and then use fastReseg_perFOV_full_process()
function to process each transcript data.frame
. For each transcript data.frame
, the pipeline would score each transcript based on the provided cell type-specific reference profiles, evaluate the goodness-of-fit of each transcript within original cell segment, identify the low-score transcript groups within cells that has strong spatial dependency in transcript score profile, evaluate the neighborhood environment of low-score transcript groups and perform resegmentation actions including triming to extracellular space, merging to neighbor cell or labeling as new cell.
- Counts matrix for entire data set, cells X genes.
- Either a vector of cluster assignments for each cell, or a matrix of genes X clusters reference profiles that could be used for internal cluster assignment.
- Either a
data.frame
of transcript level information with unique cell ids, or adata.frame
with each row for each individual file of per FOV transcriptdata.frame
within which the coordinates and cell ids are unique, columns include the file path of per FOV transcriptdata.frame
file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire data set. - additional arguments describing input data structures and output data format.
- additional arguments for finer control on error detection, flagging and correction.
A list, with the following elements:
- refProfiles: a genes X clusters matrix of cluster-specific reference profiles used in resegmenation pipeline.
- baselineData: a list of two matrice in cluster X percentile format for the cluster-specific percentile distribution of per cell value;
span_score
is for the average per molecule transcript tLLR score of each cell,span_transNum
is for the transcript number of each cell. - cutoffs_list: a list of cutoffs used in resegmentation pipeline, including,
score_baseline
,lowerCutoff_transNum
,higherCutoff_transNum
,cellular_distance_cutoff
,molecular_distance_cutoff
. - ctrl_genes: a vector of control genes whose transcript scores are set to fixed value for all cell types, return when
ctrl_genes
is notNULL
. - updated_perCellDT: a per cell
data.table
with mean spatial coordinates, new cell type and resegmentation action after resegmentation, return whenreturn_perCellData = TRUE
. - updated_perCellExprs: a gene x cell count sparse matrix for updated transcript
data.frame
after resegmentation, return whenreturn_perCellData = TRUE
. - reseg_actions: a list of 4 elements describing how the resegmenation would be performed on original
transcript_df
by the group assignment of transcripts listed ingroupDF_ToFlagTrans
, output ofdecide_ReSegment_Operations()
function, return whensave_intermediates = TRUE
. - updated_transDF_list: a list of per-FOV transcript
data.frame
with updated cell segmenation inupdated_cellID
andupdated_celltype
columns, return whentransDF_export_option = 2
.
The pipeline function would save the each per FOV output as individual file in path_to_output
directory;updated_transDF
would be saved as csv file. Whensave_intermediates = TRUE
, all intermediate files and resegmenation outputs of each FOV would be saved as single .rds
object in 1 list containing the following elements:
- modStats_ToFlagCells: a
data.frame
for spatial modeling statistics of each cell, output ofscore_cell_segmentation_error()
function, save whensave_intermediates = TRUE
. - groupDF_ToFlagTrans:
data.frame
for the group assignment of transcripts within putative wrongly segmented cells, merged output offlag_bad_transcripts()
andgroupTranscripts_Delaunay()
orgroupTranscripts_dbscan()
functions, save whensave_intermediates = TRUE
. - neighborhoodDF_ToReseg: a
data.frame
for neighborhood enviornment of low-score transcript groups, output ofget_neighborhood_content()
function, save whensave_intermediates = TRUE
. - reseg_actions: a list of 4 elements describing how the resegmenation would be performed on original
transcript_df
by the group assignment of transcripts listed ingroupDF_ToFlagTrans
, output ofdecide_ReSegment_Operations()
function, save whensave_intermediates = TRUE
. - updated_transDF: the updated transcript_df with
updated_cellID
andupdated_celltyp`` column based on
reseg_full_converter, write to disk when
transDF_export_option =1`. - updated_perCellDT: a per cell
data.table
with mean spatial coordinates, new cell type and resegmentation action after resegmentation, return whenreturn_perCellData = TRUE
- updated_perCellExprs: a gene x cell count sparse matrix for updated transcript
data.frame
after resegmentation, return whenreturn_perCellData = TRUE
.
The pipeline would also combine per cell data for all FOVs and return the combined data when return_perCellData = TRUE
; updated_perCellDT
and updated_perCellExprs
would also be saved in a list as single .rds
object in path_to_output
directory when transDF_export_option = 1
.
- updated_perCellDT: a per cell
data.table
with mean spatial coordinates, new cell type and resegmentation action after resegmentation, return whenreturn_perCellData = TRUE
. - updated_perCellExprs: a gene x cell count sparse matrix for updated transcript
data.frame
after resegmentation, return whenreturn_perCellData = TRUE
.
fastReseg_perFOV_full_process()
function is the core wrapper for resegmentation pipeline using transcript score matrix derived from external reference profiles and preset cutoffs. This function would score each transcript based on the provided cell type-specific reference profiles, evaluate the goodness-of-fit of each transcript within original cell segment, identify the low-score transcript groups within cells that has strong spatial dependency in transcript score profile, evaluate the neighborhood environment of low-score transcript groups and perform resegmentation actions including trimming to extracellular space, merging to neighbor cell or labeling as new cell.
- a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
for each transcript with columns for transcript_id, target or gene name, original cell_id, spatial coordinates. - additional arguments describing input data structures and output data format.
- additional arguments for finer control on error detection, flagging and correction.
A list, with the following elements:
- modStats_ToFlagCells: a
data.frame
for spatial modeling statistics of each cell, output ofscore_cell_segmentation_error()
function, return whenreturn_intermediates
= TRUE} - groupDF_ToFlagTrans:
data.frame
for the group assignment of transcripts within putative wrongly segmented cells, merged output offlag_bad_transcripts()
andgroupTranscripts_Delaunay()
orgroupTranscripts_dbscan()
functions, return whenreturn_intermediates = TRUE
. - neighborhoodDF_ToReseg: a
data.frame
for neighborhood enviornment of low-score transcript groups, output ofget_neighborhood_content
function, return whenreturn_intermediates = TRUE
. - reseg_actions: a list of 4 elements describing how the resegmenation would be performed on original
transcript_df
by the group assignment of transcripts listed ingroupDF_ToFlagTrans
, output ofdecide_ReSegment_Operations_leidenCut
function, return whenreturn_intermediates = TRUE
. - updated_transDF: the updated transcript_df with
updated_cellID
andupdated_celltype
column based on reseg_full_converter} - updated_perCellDT: a per cell data.table with mean spatial coordinates, new cell type and resegmentation action after resegmentation, return when
return_perCellData = TRUE
. - updated_perCellExprs: a gene x cell count sparse matrix for updated transcript
data.frame
after resegmentation, return whenreturn_perCellData = TRUE
.
runPreprocess()
function is a modular wrapper to get baseline data and cutoffs from entire dataset.
- Counts matrix for entire data set, cells X genes.
- Either a vector of cluster assignments for each cell, or a matrix of genes X clusters reference profiles that could be used for internal cluster assignment.
- Either a
data.frame
of transcript level information with unique cell ids, or adata.frame
with each row for each individual file of per FOV transcriptdata.frame
within which the coordinates and cell ids are unique, columns include the file path of per FOV transcriptdata.frame
file, annotation columns like slide and fov to be used as prefix when creating unique cell_ID across entire data set. - additional arguments describing input data structures and output data format
- additional arguments for finer control on error detection, flagging and correction.
- optional inputs of external baseline and cutoffs for transcript scores and transcript number to skip calculation based on the provided dataset.
A list, with the following elements:
- clust: vector of cluster assignments for each cell in
counts
, used in caculatingbaselineData
. - refProfiles: a genes X clusters matrix of cluster-specific reference profiles to use in resegmenation pipeline.
- baselineData: a list of two matrices in cluster X percentile format for the cluster-specific percentile distribution of per cell value;
span_score
is for the average per molecule transcript tLLR score of each cell,span_transNum
is for the transcript number of each cell.\ - cutoffs_list: a list of cutoffs to use in resegmentation pipeline, including,
score_baseline
,lowerCutoff_transNum
,higherCutoff_transNum
,cellular_distance_cutoff
,molecular_distance_cutoff
. - ctrl_genes: a vector of control genes whose transcript scores are set to fixed value for all cell types, return when
ctrl_genes
is notNULL
. - score_GeneMatrix: a gene x cell-type score matrix to use in resegmenation pipeline, the scores for
ctrl_genes
are set to be the same assvmClass_score_cutoff
. - processed_1st_transDF: a list of 2 elements for the intracellular and extracellular transcript
data.frame
of the processed outcomes of 1st transcript file.
The cutoffs_list
is a list containing
- score_baseline: a named vector of score baseline under each cell type listed in
refProfiles
such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence. - lowerCutoff_transNum: a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is.
- higherCutoff_transNum: a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type.
- cellular_distance_cutoff: maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, unit in micron. - molecular_distance_cutoff: maximum molecule-to-molecule distance within connected transcript group, unit in micron.
runSegErrorEvaluation()
function is a modular wrapper to flag cell segmentation error.
- a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
for each transcript with columns for transcript_id, target or gene name, cell_id, spatial coordinates. - additional arguments describing input data structures.
- additional arguments for finer control on error detection, and flagging at cell level.
A list, with the following elements:
- modStats_ToFlagCells: a
data.frame
contains evaluation model statistics in columns for each cell's potential to have segmentation error. - transcript_df: transcript
data.frame
with 2 additional columns:tLLR_maxCellType
for cell types of maxmium transcript score under current segments andscore_tLLR_maxCellType
for the corresponding transcript score for each transcript.
runTranscriptErrorDetection()
function is a modular wrapper to identify transcript groups of poor fit to current cell segments in space.
- a vector of cell ID for cells of interest, typically the cells flagged by
runSegErrorEvaluation()
function. - a gene x cell-type matrix of log-like score of gene in each cell type
- a
data.frame
for each transcript with columns for transcript_id, cell_id, spatial coordinates and transcript score, typically the outputs ofrunSegErrorEvaluation()
function. - a string indicating how to group transcripts in space, use either "dbscan" or "delaunay" method.
- additional arguments describing input data structures.
- additional arguments for finer control on error detection at transcript level.
a data.frame
for transcripts in cells of interest only, containing information for transcript score classifications and spatial group assignments as well as new cell/group ID for downstream resegmentation.
prepResegDF()
function is a supporting function for fastReseg_perFOV_full_process()
function, combine runTranscriptErrorDetection()
output with transcript data.frame
to prep for runSegRefinement()
- a
data.frame
for each transcript with columns for transcript_id, target or gene name, original cell_id, spatial coordinates and cell type under which each transcript group gives the maximum transcript score, typically the outputs ofrunSegErrorEvaluation()
function. - a
data frame
for transcripts in cells of interest only, with columns forconnect_group
,tmp_cellID
,group_maxCellType
, output ofrunTranscriptErrorDetection()
function
A list, with the following elements: - reseg_transcript_df: data.frame
with transcript_id, target or gene name, x, y, cell_id for all transcript groups in tmp_cellID
column and the cell type of maximum transcript scores for each transcript group in group_maxCellType
column. - groups_to_reseg: vector of chosen transcript groups need to be evaluate for re-segmentation.
runSegRefinement()
function is a modular wrapper to evaluate transcript groups in neighborhood, decide resegmentation operations and execute.
- a vector of transcript group ids for cells of interest, typically the transcript groups in the cells flagged by
runTranscriptErrorDetection()
function. - a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
for each transcript with columns for transcript_id, target or gene name, spatial coordinates, original cell id, group id for all transcript groups and the cell type of maximum transcript scores for each transcript group, typically the outputs ofprepResegDF()
function. - a named vector of score baseline for all cell type such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence, typically the output of
runPreprocess()
function. - a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is, typically the output of
runPreprocess()
function. - a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type, typically the output of
runPreprocess()
function. - additional arguments describing input data structures and output data format.
- additional arguments for finer control on error correction.
A list, with the following elements:
- updated_transDF: the updated
transcript_df
withupdated_cellID
andupdated_celltype
column based onreseg_full_converter
. - neighborhoodDF_ToReseg: a
data.frame
for neighborhood enviornment of low-score transcript groups, output ofget_neighborhood_content()
function, return whenreturn_intermediates = TRUE
. - reseg_actions: a list of 4 elements describing how the resegmenation would be performed on original
transcript_df
by the group assignment of transcripts listed ingroupDF_ToFlagTrans
, output ofdecide_ReSegment_Operations()
function, return whenreturn_intermediates = TRUE
. - updated_perCellDT: a per cell
data.table
with mean spatial coordinates, new cell type and resegmentation action after resegmentation, return whenreturn_perCellData = TRUE
. - updated_perCellExprs: a gene x cell count sparse matrix for updated transcript
data.frame
after resegmentation, return whenreturn_perCellData = TRUE
.
score_cell_segmentation_error()
function scores each cell for how much their transcripts change their goodness-of-fit over space. It is a supporting function for runSegErrorEvaluation()
modular wrapper function.
- a
data.frame
of transcript_id, cell_id, transcript score, spatial coordinates. - a cutoff of transcript number to do spatial modeling
- additional arguments describing input data structures.
A data.frame
with columns for cell_id, number of transcripts in given cell, r.squared value of alternative model where transcript score has significant spatial dependency, lrtest chi-squared value and p-value for lrtest probability larger than chi-squared value.
flag_bad_transcripts()
function finds out the spatially connected transcripts among chosen_transcripts based on SVM spatial model which scores each cell for how much their transcripts change their goodness-of-fit over space. It is a supporting function for runTranscriptErrorDetection()
modular wrapper function.
- a vector of cell ids for cells of interest, typically the cells flagged by
runSegErrorEvaluation()
function. - a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
for each transcript with columns for transcript_id, transcript score, spatial coordinates, cell_id. - additional arguments describing input data structures.
- additional arguments for finer control on spatial modeling and error detection.
A data.frame
with columns for transcript_id, target or gene name, cell_id, spatial coordinates, transcript score, SVM class with 0 for below cutoff and 1 for above cutoff, decision values of svm model output, new cell type for each transcript groups within each cells.
groupTranscripts_Delaunay()
function groups the flagged transcript within each cell based on spatial connectivity of their transcript delaunay network. It's a supporting function for runTranscriptErrorDetection()
modular wrapper function.
- a vector of transcript ids of interest, typically the cells flagged by
flag_bad_transcripts()
function. - a transcript
data.frame
with transcript_id, target or gene Name, spatial coordinates and cell_id - a configure list on controlling the spatial network generation.For more details, see manual for
createSpatialDelaunayNW_from_spatLocs()
function for more details. - the maximum distance allowed within connected transcript group
A data.frame
for transcripts of interest only, containing columns for cell ids, transcript ids, spatial coordinates, group id for spatially connected transcripts.
createSpatialDelaunayNW_from_spatLocs()
function generates delaunay network based on provided config and spatial location. It is a supporting function for groupTranscripts_Delaunay()
.
- a
data.frame
for spatial location of each entry for cell or transcript - a configure list on controlling the spatial network generation.For more details, see the manual for
GiottoClass::createSpatialNetwork
.
a delaunay_network_Obj
, a spatial network object created by GiottoClass
functions. For more details, see the manual for GiottoClass::createSpatialNetwork
.
groupTranscripts_dbscan()
functiongroups the flagged transcript within each cell based on spatial clustering using dbscan
. It's a supporting function for runTranscriptErrorDetection()
modular wrapper function.
- a vector of transcript ids of interest, typically the transcripts of the cells flagged by
flag_bad_transcripts()
function. - a transcript
data.frame
with transcript_id, target or gene Name, spatial coordinates and cell_id - the maximum distance allowed within connected transcript group
A data.frame
for transcripts of interest only, containing columns for cell ids, transcript ids, spatial coordinates, group id for spatially connected transcripts.
get_neighborhood_content()
function finds neighbor cells with transcripts that are direct neighbor of chosen cells, check transcript score under neighbor cell type, return neighborhood information. It is a supporting function for runSegRefinement()
modular wrapper function.
- a vector of transcript group ids for cells of interest, typically the transcript groups in the cells flagged by
runTranscriptErrorDetection()
function. - a gene x cell-type matrix of log-like score of gene in each cell type.
- a named vector of transcript score baseline for all cell types.
- a
data.frame
for each transcript with columns for transcript_id, spatial coordinates, group ids for all transcript groups including the original cell ids for cells not being flagged. - additional arguments describing input data structures.
- additional arguments for finer control on defining neighborhood.
A neighborhood information data.frame
for transcript groups of interest, containing the following columns:
- CellId: original cell or transcript group id of chosen cells.
- cell_type: original cell type of chosen cells.
- transcript_num: number of transcripts in chosen cells.
- self_celltype: cell type give maximum score for query cell only.
- score_under_self: score in query cell under its own maximum celltype.
- neighbor_CellId: cell id of neighbor cell whose cell type gives maximum score in query cell among all neighbors, not including query cell itself.
- neighbor_celltype: cell type that gives maximum score in query cell among all non-self neighbor cells. -score_under_neighbor: score in query cell under neighbor_celltype.
decide_ReSegment_Operations()
function evaluates neighborhood information against score and transcript number cutoff to decide the resegmetation operations. Use either leiden clustering or geometry statistics to determine whether a merge event is allowed. It is a supporting function for runSegRefinement()
modular wrapper function.
- a neighborhood information
data.frame
for transcript groups of interest, typically the output ofget_neighborhood_content()
function. - a string indicating use either "leidenCut" (in 2D or 3D) or "geometryDiff" (in 2D only) method to determine whether a cell pair merging event is allowed in space.
- a named vector of score baseline for all cell type such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence, typically the output of
runPreprocess()
function. - a named vector of transcript number cutoff under each cell type such that higher than the cutoff is required to keep query cell as it is, typically the output of
runPreprocess()
function. - a named vector of transcript number cutoff under each cell type such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type, typically the output of
runPreprocess()
function. - a cutoff on spatial constraint on a valid merging event between two source transcript groups.
- additional arguments describing input data structures.
A list, with the following elements:
- cells_to_discard: a vector of cell ID that should be discarded during resegmentation.
- cells_to_update: a named vector of cell ID where the cell ID in name would be replaced with cell_ID in value.
- cells_to_keep: a vector of cell ID that should be kept as it is.
- reseg_full_converter: a single named vector of cell ID to update the original cell ID, assign
NA
forcells_to_discard
.
update_transDF_ResegActions()
function updates transcript data.frame
based on resegmentation action, calculates the new cell type and mean per cell spatial coordinates. It is a supporting function for runSegRefinement()
modular wrapper function.
- a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
for each transcript to be updated, containing columns for transcript_id, target or gene name, spatial coordinates, original cell id, group id for all transcript groups and the cell type of maximum transcript scores for each transcript group, typically the outputs ofprepResegDF()
function. - a single named vector of cell ID to update the original cell ID, typicallly outputs of
decide_ReSegment_Operations()
function. - additional arguments describing input data structures and output data format.
A list, with the following elements:
- updated_transDF: the updated transcript_df with
updated_cellID
andupdated_celltype
column based onreseg_full_converter
. - perCell_DT: a per cell
data.table
with mean spatial coordinates and new cell type whenreturn_perCellDF = TRUE
. - perCell_expression: a gene x cell count sparse matrix for updated transcript
data.frame
whenreturn_perCellDF = TRUE
.
prepare_perFOV_transDF()
function convert per FOV unique IDs and spatial coordinates of cells and transcripts to the ones unique for the entire dataset. It also converts pixel coordinates to um. It is supporting function for runPreprocess()
modular wrapper function, fastReseg_flag_all_errors()
and fastReseg_full_pipeline()
pipeline wrapper functions.
- a
data.frame
for per FOV transcript level information within which the coordinates and cell ids are unique. - a named vector of fov 2D coordinates.
- additional arguments describing input data structures.
- additional arguments for finer control on how to stitch per FOV data into entire dataset of multiple FOVs.
A list, with the following elements:
- intraC: a
data.frame
for intracellular transcript,UMI_transID
andUMI_cellID
as column names for unique transcript_id and cell_id,target
as column name for target gene name. - extraC: a
data.frame
for extracellular transcript, same structure as theintraC
data frame in returned list.
get_baselineCT()
function gets cluster-specific quantile distribution of transcript number and per cell per molecule transcript score in the provided cell x gene expression matrix based on the reference profiles and cell cluster assignment. The function would also recommend the cutoff for transcript score and transcript number to be used in re-segmentation pipeline based on the calculated quantile distribution. It is supporting function for runPreprocess()
modular wrapper function.
- a matrix of genes X clusters reference profiles that could be used for internal cluster assignment.
- Counts matrix for entire data set, cells X genes.
- optional input of external cluster assignments for each cell to skip nternal cluster assignment based on the provided reference profiles.
A list, with the following elements:
- span_score: a matrix of average transcript tLLR score per molecule per cell for 22 distinct cell types in rows, percentile at (0%, 25%, 50%, 75%, 100%) in columns.
- span_transNum: a matrix of transcript number per cell for each distinct cell types in row, percentile at (0%, 25%, 50%, 75%, 100%) in columns.
- score_baseline: a named vector of 25% quantile of cluster-specific per cell transcript score, to be used as score baseline such that per cell transcript score higher than the baseline is required to call a cell type of high enough confidence,
- lowerCutoff_transNum: a named vector of 25% quantile of cluster-specific per molecule per cell transcript number, to be used as transcript number cutoff such that higher than the cutoff is required to keep query cell as it is.
- higherCutoff_transNum: a named vector of median value of cluster-specific per molecule per cell transcript number, to be used as transcript number cutoff such that lower than the cutoff is required to keep query cell as it is when there is neighbor cell of consistent cell type.
- clust_used: a named vector of cluster assignments for each cell used in baseline calculation, cell_ID in
counts
as name.
choose_distance_cutoff()
function chooses appropriate cellular distance cutoff and molecular distance cutoff based on input transcript data.frame
for downstream resegmentation; cellular distance cutoff is defined as the search radius of direct neighbor cell, while molecular distance cutoff is defined as the maximum distance between two neighbor transcripts from same source cells. It is supporting function for runPreprocess()
modular wrapper function.
- a
data.frame
for per FOV transcript level information within which the coordinates and cell ids are unique. - additional arguments describing input data structures.
- additional arguments for finer control on sub-sampling the input data for molecular distance cutoff estimation.
A list, with the following elements:
- cellular_distance_cutoff: maximum cell-to-cell distance in x, y between the center of query cells to the center of neighbor cells with direct contact, same unit as input spatial coordinate.
- perCell_coordDT: a
data.table
with cell in row, spatial XY coordinates of centroid and dimensions of bounding box in column. - molecular_distance_cutoff: maximum molecule-to-molecule distance within connected transcript group, same unit as input spatial coordinate; return if
run_molecularDist = TRUE
. - distance_profile: a named vector for the quantile profile of minimal molecular distance between transcripts belong to different cells at step size of 10% quantile; return if
run_molecularDist = TRUE
,
scoreGenesInRef()
function calculates log-likilhood score of each gene based on reference expression profiles and returns the centered score matrix. It is utility function used by other wrapper functions.
- a vector of gene name to score
- a gene X cell_type expression matrix for reference profiles
- flag to center the score matrix per gene before return
- a gene X cell type matrix of loglik score for each gene under each cell type.
getCellType_maxScore()
function gets the cell type give maximum transcript score. It is utility function used by other wrapper functions.
- a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
of transcript level information with unique cell and transcript ids. - additional arguments describing input data structures.
a named vector with cell type in values and cell ID in names.
getScoreCellType_gene()
function gets each transcript's score based on score matrix and chosen cell-type. It is utility function used by other wrapper functions.
- a gene x cell-type matrix of log-like score of gene in each cell type.
- a
data.frame
of transcript level information with unique cell and transcript ids, and cell type. - additional arguments describing input data structures.
a named vector with score of given cell type in values and transcript_id in names
estimate_MeanProfile()
function estimates the mean profile of each cluster, given the input cell type assignments. It is utility function used by other wrapper functions.
- Counts matrix for entire data set, cells X genes.
- a vector of cluster assignments for each cell.
- a vector of scaling factors for each cell.
- a numeric value for expected background in count matrix
A matrix of genes X clusters profiles observed in dataset.