Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counts accuracy #46

Open
arteymix opened this issue Aug 4, 2023 · 0 comments
Open

Counts accuracy #46

arteymix opened this issue Aug 4, 2023 · 0 comments

Comments

@arteymix
Copy link
Member

arteymix commented Aug 4, 2023

I'm aggregating here all the issues with counts and workarounds we have to use to make them as accurate as possible.

We're generally very close to what the getDatasets endpoint returns.

To deal with over-counting, we cap the counts of selected options by the total number of datasets returned by getDatasets. This is not done in debug mode.

ACLs might not always be applied correctly

The queries that perform the counting need to use ACLs in order to count what the current user is entitled to see.

Another thing that we have to be careful with is the troubled status of platforms and datasets.

Platforms are filtered by both current and original platforms

BioAssays can have two platform associated to them so that we can keep track of the original platform if it is replaced in the future. This is necessary for RNA-Seq datasets as we start with a sequencing platform (i.e. Illumina Hiseq 2500) and later on switch to a gene list platform (i.e. gene by NCBI IDs). When counting platforms, we combine the usage frequencies of the two separate queries. These are generally off by one or two.

Technology Type

This mainly applies to counting how many RNA-Seq dataset we have. It is done by summing the counts of all the sequencing platforms. However, some RNA-Seq datasets might hold a mixture of different platforms, resulting in overcounting.

Terms can appear in multiple categories

Because terms restriction are subclauses, they cannot contain a conjunction for both the category and term. This means that filtering for "Alzheimer's Disease" will yield results from both "Disease" and "Disease Model" categories. We mitigate this by adding a clause that ensures that there is at least one annotation with the "Disease" category, but that will not work if a dataset contains an unrelated "Disease" category (i.e. "Disease: schizophrenia" with "Disease Model: Alzheimer's").

Implied terms are not included in the counts of a term

If an ontology term implies other terms (i.e. brain implies all of its regions as well as specific brains), the counts of the datasets of the implied terms are not added up. See PavlidisLab/Gemma#847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant