Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChEBI subsets #105

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

ChEBI subsets #105

wants to merge 3 commits into from

Conversation

joeflack4
Copy link
Collaborator

@joeflack4 joeflack4 commented Oct 28, 2024

So far creates sub-hierarchy of ChEBI for only what is mapped to LOINC. Main future goal is to use to create alt LOINC hierarchy.

Changes

ChEBI subsets

  • Add: makefile: To add goals for creating these outputs.
  • Update: .gitignore: To include folders for these inputs/outputs.

Results

Google Drive

@joeflack4 joeflack4 marked this pull request as draft October 28, 2024 22:38
@joeflack4 joeflack4 self-assigned this Oct 28, 2024

# input
input/*
!input/owl-files/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where to put ChEBI inputs

I have the goal of moving all of the sources somewhere in input/. I've decided that'll probably be input/sources/, or something to denote all of the inputs that actually go in to generating our outputs.

As I'm not sure yet if ChEBI will be such an input, I'm putting it in input/analysis/.

I also plan to eventually move owl-files/ and data/ into input/, but won't do that in this PR.

Copy link
Collaborator Author

@joeflack4 joeflack4 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makefile

@ShahimEssaid The makefile is back. I find that this will likely be the best way to do this work. Let me know if you have any thoughts otherwise.

Added an all goal which just includes chebi-subsets.

chebi-subsets includes:

CHEBI_OUT_BOT=output/analysis/chebi-subset-BOT.owl
CHEBI_OUT_MIREOT=output/analysis/chebi-subset-MIREOT.owl

These goals require this as input:

CHEBI_MODULE=output/analysis/chebi_module.txt

...which queries:

PART_MAPPINGS=loinc_release/Loinc_2.78/AccessoryFiles/PartFile/PartRelatedCodeMapping.csv

Results: Google Drive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there are more mappings to be found here: monarch-initiative/monarch-mapping-commons#35

- Add: makefile: To add goals for creating these outputs.
- Update: .gitignore: To include folders for these inputs/outputs.
@joeflack4 joeflack4 force-pushed the chebi-subset branch 2 times, most recently from 8307fa8 to 003d12a Compare October 29, 2024 03:54
@joeflack4 joeflack4 marked this pull request as ready for review October 29, 2024 03:54
@joeflack4 joeflack4 changed the title ChEBI subset ChEBI subsets Oct 29, 2024
makefile Outdated Show resolved Hide resolved

# todo: bug fix for label comment: Alwyas shows up as ' # ,'. Alternatively, I could just not include the label comment.
$(CHEBI_MODULE): $(PART_MAPPINGS) | output/analysis/
awk -F'",' '/ebi\.ac\.uk\/chebi/ { \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting module via awk

@ShahimEssaid For extracting the unique list of mapped ChEBI terms from the LOINC mapping CSV, I wanted to not rely on Python for this, but maybe I'll change that for a few reasons:

  1. Windows doesn't come with awk, etc.
  2. Couldn't get it to display commented labels next to the terms.
  3. The awk command, and parsing CSV with default unix commands, is still non-trivial. It's hard to read / maintain.
  4. Has an off-by-1 (row) error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its easy enough, write a sssom adapter and add to sssom-py? Note that you will have other mappings to process, e.g. monarch-initiative/monarch-mapping-commons#35, since the LOINC part mappings is always incomplete, so makes sense to try and convert everything to sssom first

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the SSSOM part is a separate issue, but thanks for the reminder. I think for this proof of concept a very quick pandas func will suffice.

Are there some docs on what it means to create a "SSSOM adapter" for a given source?
I wonder if that may be non-ideal / impossible in this case, because the source mapping CSV requires download of the full LOINC release, which is also behind a license.

@joeflack4 joeflack4 mentioned this pull request Oct 30, 2024
2 tasks
Copy link
Collaborator Author

@joeflack4 joeflack4 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we use these outputs?

One/both of these?

Or do we have other goals in mind?

makefile Outdated
chebi-subsets: $(CHEBI_OUT_BOT) $(CHEBI_OUT_MIREOT)

input/analysis/chebi.owl.gz: | input/analysis/
wget -O $@ ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz
Copy link

@matentzn matentzn Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ontology PURL

This is more a general thing but always use PURLs for ontology download locations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn This is a good idea. Don't we typically use Bioregistry as the central location for getting canonical PURLs?

However, I don't see an .owl URI there, only URIs for prefix maps.

There is a URI for it at Ontobee:

These PURLs have the disadvantage though of pointing to the .owl, not the .owl.gz, which I think is a better option whenever it is available.

Copy link

@matentzn matentzn Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@joeflack4 joeflack4 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn OK that's good. I just pushed a commit and now it's using the PURL!


But what I'm sort of asking is how do I know what the PURL URI is for a given ontology URI? Is there somewhere that I can search?

I thought that place was BioRegistry, but that's not the case here.

I suppose I could just type out http://purl.obolibrary.org/obo/MAIN_ONTOLOGY_SPELLING.owl or http://purl.obolibrary.org/obo/MAIN_ONTOLOGY_SPELLING/MAIN_ONTOLOGY_SPELLING.FILE_EXTENSION and check to see if they exist, but that's not a good UX.

- Delete: Alternative, commented out, variations of subsetting ChEBI.
@joeflack4 joeflack4 force-pushed the main branch 3 times, most recently from 6cdc025 to 957af41 Compare November 11, 2024 02:13
- Update: Download URI: Changed to PURL
@joeflack4 joeflack4 mentioned this pull request Dec 30, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 3. Review
Development

Successfully merging this pull request may close these issues.

2 participants