You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm aggregating here all the issues with counts and workarounds we have to use to make them as accurate as possible.
We're generally very close to what the getDatasets endpoint returns.
To deal with over-counting, we cap the counts of selected options by the total number of datasets returned by getDatasets. This is not done in debug mode.
ACLs might not always be applied correctly
The queries that perform the counting need to use ACLs in order to count what the current user is entitled to see.
Another thing that we have to be careful with is the troubled status of platforms and datasets.
Platforms are filtered by both current and original platforms
BioAssays can have two platform associated to them so that we can keep track of the original platform if it is replaced in the future. This is necessary for RNA-Seq datasets as we start with a sequencing platform (i.e. Illumina Hiseq 2500) and later on switch to a gene list platform (i.e. gene by NCBI IDs). When counting platforms, we combine the usage frequencies of the two separate queries. These are generally off by one or two.
Technology Type
This mainly applies to counting how many RNA-Seq dataset we have. It is done by summing the counts of all the sequencing platforms. However, some RNA-Seq datasets might hold a mixture of different platforms, resulting in overcounting.
Terms can appear in multiple categories
Because terms restriction are subclauses, they cannot contain a conjunction for both the category and term. This means that filtering for "Alzheimer's Disease" will yield results from both "Disease" and "Disease Model" categories. We mitigate this by adding a clause that ensures that there is at least one annotation with the "Disease" category, but that will not work if a dataset contains an unrelated "Disease" category (i.e. "Disease: schizophrenia" with "Disease Model: Alzheimer's").
Implied terms are not included in the counts of a term
If an ontology term implies other terms (i.e. brain implies all of its regions as well as specific brains), the counts of the datasets of the implied terms are not added up. See PavlidisLab/Gemma#847
The text was updated successfully, but these errors were encountered:
I'm aggregating here all the issues with counts and workarounds we have to use to make them as accurate as possible.
We're generally very close to what the
getDatasets
endpoint returns.To deal with over-counting, we cap the counts of selected options by the total number of datasets returned by
getDatasets
. This is not done in debug mode.ACLs might not always be applied correctly
The queries that perform the counting need to use ACLs in order to count what the current user is entitled to see.
Another thing that we have to be careful with is the troubled status of platforms and datasets.
Platforms are filtered by both current and original platforms
BioAssays can have two platform associated to them so that we can keep track of the original platform if it is replaced in the future. This is necessary for RNA-Seq datasets as we start with a sequencing platform (i.e. Illumina Hiseq 2500) and later on switch to a gene list platform (i.e. gene by NCBI IDs). When counting platforms, we combine the usage frequencies of the two separate queries. These are generally off by one or two.
Technology Type
This mainly applies to counting how many RNA-Seq dataset we have. It is done by summing the counts of all the sequencing platforms. However, some RNA-Seq datasets might hold a mixture of different platforms, resulting in overcounting.
Terms can appear in multiple categories
Because terms restriction are subclauses, they cannot contain a conjunction for both the category and term. This means that filtering for "Alzheimer's Disease" will yield results from both "Disease" and "Disease Model" categories. We mitigate this by adding a clause that ensures that there is at least one annotation with the "Disease" category, but that will not work if a dataset contains an unrelated "Disease" category (i.e. "Disease: schizophrenia" with "Disease Model: Alzheimer's").
Implied terms are not included in the counts of a term
If an ontology term implies other terms (i.e. brain implies all of its regions as well as specific brains), the counts of the datasets of the implied terms are not added up. See PavlidisLab/Gemma#847
The text was updated successfully, but these errors were encountered: