Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ChEBI subsets #105

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,8 @@ test/input/*
!test/input/.keep
test/output/*
!test/output/.keep

# input
input/*
!input/owl-files/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where to put ChEBI inputs

I have the goal of moving all of the sources somewhere in input/. I've decided that'll probably be input/sources/, or something to denote all of the inputs that actually go in to generating our outputs.

As I'm not sure yet if ChEBI will be such an input, I'm putting it in input/analysis/.

I also plan to eventually move owl-files/ and data/ into input/, but won't do that in this PR.

!input/data/
65 changes: 65 additions & 0 deletions makefile
Copy link
Collaborator Author

@joeflack4 joeflack4 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makefile

@ShahimEssaid The makefile is back. I find that this will likely be the best way to do this work. Let me know if you have any thoughts otherwise.

Added an all goal which just includes chebi-subsets.

chebi-subsets includes:

CHEBI_OUT_BOT=output/analysis/chebi-subset-BOT.owl
CHEBI_OUT_MIREOT=output/analysis/chebi-subset-MIREOT.owl

These goals require this as input:

CHEBI_MODULE=output/analysis/chebi_module.txt

...which queries:

PART_MAPPINGS=loinc_release/Loinc_2.78/AccessoryFiles/PartFile/PartRelatedCodeMapping.csv

Results: Google Drive

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note there are more mappings to be found here: monarch-initiative/monarch-mapping-commons#35

Copy link
Collaborator Author

@joeflack4 joeflack4 Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will we use these outputs?

One/both of these?

Or do we have other goals in mind?

Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
.PHONY=all chebi-subsets

# All ------------------------------------------------------------------------------------------------------------------
all: chebi-subsets

# Analysis -------------------------------------------------------------------------------------------------------------
input/analysis/:
mkdir -p $@

output/analysis/:
mkdir -p $@

# - ChEBI subsets
PART_MAPPINGS=loinc_release/Loinc_2.78/AccessoryFiles/PartFile/PartRelatedCodeMapping.csv
CHEBI_OWL=input/analysis/chebi.owl
CHEBI_MODULE=output/analysis/chebi_module.txt
CHEBI_OUT_BOT=output/analysis/chebi-subset-BOT.owl
CHEBI_OUT_MIREOT=output/analysis/chebi-subset-MIREOT.owl

chebi-subsets: $(CHEBI_OUT_BOT) $(CHEBI_OUT_MIREOT)

input/analysis/chebi.owl.gz: | input/analysis/
wget -O $@ ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl.gz
Copy link

@matentzn matentzn Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ontology PURL

This is more a general thing but always use PURLs for ontology download locations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn This is a good idea. Don't we typically use Bioregistry as the central location for getting canonical PURLs?

However, I don't see an .owl URI there, only URIs for prefix maps.

There is a URI for it at Ontobee:

These PURLs have the disadvantage though of pointing to the .owl, not the .owl.gz, which I think is a better option whenever it is available.

Copy link

@matentzn matentzn Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@joeflack4 joeflack4 Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matentzn OK that's good. I just pushed a commit and now it's using the PURL!


But what I'm sort of asking is how do I know what the PURL URI is for a given ontology URI? Is there somewhere that I can search?

I thought that place was BioRegistry, but that's not the case here.

I suppose I could just type out http://purl.obolibrary.org/obo/MAIN_ONTOLOGY_SPELLING.owl or http://purl.obolibrary.org/obo/MAIN_ONTOLOGY_SPELLING/MAIN_ONTOLOGY_SPELLING.FILE_EXTENSION and check to see if they exist, but that's not a good UX.


input/analysis/chebi.owl: input/analysis/chebi.owl.gz
gunzip -c $< > $@
rm $<

# todo: bug fix for label comment: Alwyas shows up as ' # ,'. Alternatively, I could just not include the label comment.
$(CHEBI_MODULE): $(PART_MAPPINGS) | output/analysis/
awk -F'",' '/ebi\.ac\.uk\/chebi/ { \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting module via awk

@ShahimEssaid For extracting the unique list of mapped ChEBI terms from the LOINC mapping CSV, I wanted to not rely on Python for this, but maybe I'll change that for a few reasons:

  1. Windows doesn't come with awk, etc.
  2. Couldn't get it to display commented labels next to the terms.
  3. The awk command, and parsing CSV with default unix commands, is still non-trivial. It's hard to read / maintain.
  4. Has an off-by-1 (row) error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its easy enough, write a sssom adapter and add to sssom-py? Note that you will have other mappings to process, e.g. monarch-initiative/monarch-mapping-commons#35, since the LOINC part mappings is always incomplete, so makes sense to try and convert everything to sssom first

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think the SSSOM part is a separate issue, but thanks for the reminder. I think for this proof of concept a very quick pandas func will suffice.

Are there some docs on what it means to create a "SSSOM adapter" for a given source?
I wonder if that may be non-ideal / impossible in this case, because the source mapping CSV requires download of the full LOINC release, which is also behind a license.

split($$0, parts, "\""); \
for (i=1; i<=NF; i++) { \
if (parts[i] ~ /CHEBI:/) { \
id = parts[i]; \
gsub(".*CHEBI:", "http://purl.obolibrary.org/obo/CHEBI_", id); \
gsub(",.*", "", id); \
print id " # " parts[i+1] \
} \
} \
}' $< > $@

# BOT: use the SLME (Syntactic Locality Module Extractor) to extract a bottom module
# - Source: https://robot.obolibrary.org/extract
# - The BOT, or BOTTOM, -module contains mainly the terms in the seed, plus all their super-classes and the
# inter-relations between them. The module is called BOT (or BOTTOM) because it takes a view from the BOTTOM of the
# class-hierarchy upwards. Modules of this type are typically of a medium size and should be used if there is a need to
# include all super-classes in the module. This is the most widely used module type - when in doubt, use this one.
$(CHEBI_OUT_BOT): $(CHEBI_OWL) $(CHEBI_MODULE)
robot extract --method BOT \
--input $(CHEBI_OWL) \
--term-file $(CHEBI_MODULE) \
--output $@

# MIREOT: Minimum Information to Reference an External Ontology Term
# - Source: https://robot.obolibrary.org/extract
# - To specify upper and lower term files, use --upper-terms and --lower-terms. The upper terms are the upper boundaries
# of what will be extracted. If no upper term is specified, all terms up to the root (owl:Thing) will be returned. The
# lower term (or terms) is required; this is the limit to what will be extracted, e.g. no descendants of the lower term
# will be included in the result.
$(CHEBI_OUT_MIREOT): $(CHEBI_OWL) $(CHEBI_MODULE)
robot extract --method MIREOT \
--input $(CHEBI_OWL) \
--lower-terms $(CHEBI_MODULE) \
--output $@