Counts accuracy #46

arteymix · 2023-08-04T20:47:36Z

I'm aggregating here all the issues with counts and workarounds we have to use to make them as accurate as possible.

We're generally very close to what the getDatasets endpoint returns.

To deal with over-counting, we cap the counts of selected options by the total number of datasets returned by getDatasets. This is not done in debug mode.

ACLs might not always be applied correctly

The queries that perform the counting need to use ACLs in order to count what the current user is entitled to see.

Another thing that we have to be careful with is the troubled status of platforms and datasets.

Platforms are filtered by both current and original platforms

BioAssays can have two platform associated to them so that we can keep track of the original platform if it is replaced in the future. This is necessary for RNA-Seq datasets as we start with a sequencing platform (i.e. Illumina Hiseq 2500) and later on switch to a gene list platform (i.e. gene by NCBI IDs). When counting platforms, we combine the usage frequencies of the two separate queries. These are generally off by one or two.

Technology Type

This mainly applies to counting how many RNA-Seq dataset we have. It is done by summing the counts of all the sequencing platforms. However, some RNA-Seq datasets might hold a mixture of different platforms, resulting in overcounting.

Terms can appear in multiple categories

Because terms restriction are subclauses, they cannot contain a conjunction for both the category and term. This means that filtering for "Alzheimer's Disease" will yield results from both "Disease" and "Disease Model" categories. We mitigate this by adding a clause that ensures that there is at least one annotation with the "Disease" category, but that will not work if a dataset contains an unrelated "Disease" category (i.e. "Disease: schizophrenia" with "Disease Model: Alzheimer's").

Implied terms are not included in the counts of a term

If an ontology term implies other terms (i.e. brain implies all of its regions as well as specific brains), the counts of the datasets of the implied terms are not added up. See PavlidisLab/Gemma#847

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counts accuracy #46

Counts accuracy #46

arteymix commented Aug 4, 2023 •

edited

Loading

Counts accuracy #46

Counts accuracy #46

Comments

arteymix commented Aug 4, 2023 • edited Loading

ACLs might not always be applied correctly

Platforms are filtered by both current and original platforms

Technology Type

Terms can appear in multiple categories

Implied terms are not included in the counts of a term

arteymix commented Aug 4, 2023 •

edited

Loading