Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--fast-run thoroughness #110

Merged
merged 7 commits into from
Dec 9, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 14 additions & 4 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,29 @@ jobs:
matrix:
python-version: ["3.11"]
steps:
# Setup
# Common setup
- name: Check out repository
uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install Poetry
- name: Install package manager
uses: snok/[email protected]
- name: Install dependencies & project
- name: Install Python dependencies
run: poetry install --no-interaction
# Download additional non-Python dependencies
- name: Download additional dependencies
run: wget https://github.com/ontodev/robot/releases/latest/download/robot.jar
# Download and extract build sources
- name: Download inputs / sources
run: |
curl -L -c cookies.txt 'https://drive.google.com/uc?export=download&id=1sXgC22eWhjbj7ueMNFFKf-jvANtAbrvE' -o confirmation.html
curl -L -b cookies.txt "https://drive.usercontent.google.com/download?id=1sXgC22eWhjbj7ueMNFFKf-jvANtAbrvE&export=download&confirm=t" -o build-sources.zip
unzip build-sources.zip
rm build-sources.zip cookies.txt confirmation.html
# Build
- name: Run test suite
- name: Build
run: poetry run comploinc --fast-run build
# Test
- name: Run test suite
Expand Down
17 changes: 14 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,15 +75,26 @@ See: `comploinc_config.yaml`

If following the setup exactly, this configuration will not need to be modified.

## Tests
## Developer docs
<details><summary>Details</summary>

### Tests: prerequisites
### Tests
#### Tests: prerequisites
1. [`robot`](https://robot.obolibrary.org/)
2. Files in `output/build-default/fast-run/`
- Can populate via `comploinc --fast-run build default`

### Tests: Running
#### Tests: Running
`python -m unittest discover`

### Standard operating procedures (SOPs)
#### Setting up new/updated inputs/sources
1. Create a new `YYYY-MM-DD_comploinc-build-sources.zip` in the [Google Drive folder](
https://drive.google.com/drive/folders/1Ae5NX959S_CV60nbf9N_37Ao-wJzM0fh). Ensure it has the correct structure (folder
names and files at the right paths).
2. Make the link public: In the Google Drive folder, right-click the file, select "Share", and click "Share." At the
bottom, under "General Access", click the left dropdown and select "Anyone with the link." Click "Copy link".
3. Update `DL_LINK_ID` in GitHub: Go to the [page](https://github.com/loinc/comp-loinc/settings/secrets/actions/DL_LINK_ID)
for updating it. Paste the link from the previous step into the box, and click "Update secret."

</details>
1 change: 1 addition & 0 deletions confirmation.html
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<html lang="en" dir=ltr><meta charset=utf-8><meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width"><title>Error 400 (Bad Request)!!1</title><style nonce="7gt0noa29gJbTb3qtKYuog">*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{color:#222;text-align:unset;margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px;}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}pre{white-space:pre-wrap;}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}</style><main id="af-error-container" role="main"><a href=//www.google.com><span id=logo aria-label=Google role=img></span></a><p><b>400.</b> <ins>That’s an error.</ins><p>The server cannot process the request because it is malformed. It should not be retried. <ins>That’s all we know.</ins></main>
1 change: 1 addition & 0 deletions confirmation2.html
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<!DOCTYPE html><html><head><title>Google Drive - Virus scan warning</title><meta http-equiv="content-type" content="text/html; charset=utf-8"/><style nonce="LnQA9tD8jvxUHuxoFSb9nQ">.goog-link-button{position:relative;color:#15c;text-decoration:underline;cursor:pointer}.goog-link-button-disabled{color:#ccc;text-decoration:none;cursor:default}body{color:#222;font:normal 13px/1.4 arial,sans-serif;margin:0}.grecaptcha-badge{visibility:hidden}.uc-main{padding-top:50px;text-align:center}#uc-dl-icon{display:inline-block;margin-top:16px;padding-right:1em;vertical-align:top}#uc-text{display:inline-block;max-width:68ex;text-align:left}.uc-error-caption,.uc-warning-caption{color:#222;font-size:16px}#uc-download-link{text-decoration:none}.uc-name-size a{color:#15c;text-decoration:none}.uc-name-size a:visited{color:#61c;text-decoration:none}.uc-name-size a:active{color:#d14836;text-decoration:none}.uc-footer{color:#777;font-size:11px;padding-bottom:5ex;padding-top:5ex;text-align:center}.uc-footer a{color:#15c}.uc-footer a:visited{color:#61c}.uc-footer a:active{color:#d14836}.uc-footer-divider{color:#ccc;width:100%}.goog-inline-block{position:relative;display:-moz-inline-box;display:inline-block}* html .goog-inline-block{display:inline}*:first-child+html .goog-inline-block{display:inline}sentinel{}</style><link rel="icon" href="//ssl.gstatic.com/docs/doclist/images/drive_2022q3_32dp.png"/></head><body><div class="uc-main"><div id="uc-dl-icon" class="image-container"><div class="drive-sprite-aux-download-file"></div></div><div id="uc-text"><p class="uc-warning-caption">Google Drive can't scan this file for viruses.</p><p class="uc-warning-subcaption"><span class="uc-name-size"><a href="/open?id=1sXgC22eWhjbj7ueMNFFKf-jvANtAbrvE">2024-09-12_comploinc-build-sources.zip</a> (225M)</span> is too large for Google to scan for viruses. Would you still like to download this file?</p><form id="download-form" action="https://drive.usercontent.google.com/download" method="get"><input type="submit" id="uc-download-link" class="goog-inline-block jfk-button jfk-button-action" value="Download anyway"/><input type="hidden" name="id" value="1sXgC22eWhjbj7ueMNFFKf-jvANtAbrvE"><input type="hidden" name="export" value="download"><input type="hidden" name="confirm" value="t"><input type="hidden" name="uuid" value="85671034-4f6b-4e95-90b4-545aabbe6772"></form></div></div><div class="uc-footer"><hr class="uc-footer-divider"></div></body></html>
4 changes: 4 additions & 0 deletions cookies.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Netscape HTTP Cookie File
# https://curl.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

4 changes: 4 additions & 0 deletions src/comp_loinc/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Additional internal configuration"""
FAST_RUN_N_TERMS = 5000
FAST_RUN_N_PARTS = 100

29 changes: 26 additions & 3 deletions src/comp_loinc/loinc_builder_steps.py
Copy link
Collaborator Author

@joeflack4 joeflack4 Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many methods to add --fast-run processing to?

The only ones Shahim set originally to --fast-run:

  • loinc_terms_all()
  • loinc_parts_all()

Joe added. This one was taking the most time:

  • SNOMED Builder: loinc_part_mapping_instances()

Joe has now added to all methods, but wondering:

  • 1. Is the extra performance that helpful?
  • 2. Does adding this processing to too many methods introduce any kind of inconsistencies between the files?
    • ...which may pose more problematic if they are combined?
    • If so, does it matter?

Compare: The performance gains between (A) only 3 methods, and (B) all methods (comment).

Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from linkml_runtime.linkml_model import ClassDefinition, SlotDefinition, Annotation

from comp_loinc import Runtime
from comp_loinc.config import FAST_RUN_N_PARTS, FAST_RUN_N_TERMS
from comp_loinc.datamodel import (
LoincPart,
LoincPartId,
Expand Down Expand Up @@ -121,7 +122,7 @@ def loinc_terms_all(
count = 0
for node in self.runtime.graph.get_nodes(LoincNodeType.LoincTerm):
count += 1
if self.configuration.fast_run and count > 10000:
if self.configuration.fast_run and count > FAST_RUN_N_TERMS:
break
if count % 1000 == 0:
logger.debug(f"Finished {count}")
Expand Down Expand Up @@ -161,8 +162,8 @@ def loinc_parts_all(

count = 0
for node in self.runtime.graph.get_nodes(LoincNodeType.LoincPart):
count = count + 1
if self.configuration.fast_run and count > 100:
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_PARTS:
break
status = node.get_property(LoincPartProps.status)
if active_only and status != "ACTIVE":
Expand Down Expand Up @@ -197,8 +198,12 @@ def entity_labels(self):
loinc_tree_loader.load_method_tree()
loinc_tree_loader.load_document_tree()

count = 0
loinc_term: LoincTerm
for loinc_term in self.runtime.current_module.get_entities_of_type(LoincTerm):
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_TERMS:
break
node = self.runtime.graph.get_node_by_code(
type_=LoincNodeType.LoincTerm, code=loinc_term.id
)
Expand Down Expand Up @@ -253,8 +258,12 @@ def entity_annotations(self):
loinc_tree_loader.load_method_tree()
loinc_tree_loader.load_document_tree()

count = 0
loinc_term: LoincTerm
for loinc_term in self.runtime.current_module.get_entities_of_type(LoincTerm):
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_TERMS:
break
node = self.runtime.graph.get_node_by_code(
type_=LoincNodeType.LoincTerm, code=loinc_term.id
)
Expand Down Expand Up @@ -364,7 +373,11 @@ def loinc_part_hierarchy_all(self):
loinc_tree_loader.load_method_tree()
loinc_tree_loader.load_document_tree()

count = 0
for child_part_node in graph.get_nodes(type_=LoincNodeType.LoincPart):
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_PARTS:
break
child_part_number = child_part_node.get_property(
type_=LoincPartProps.part_number
)
Expand Down Expand Up @@ -406,10 +419,14 @@ def loinc_term_primary_def(self):
loinc_loader = LoincLoader(graph=graph, configuration=self.configuration)
loinc_loader.load_accessory_files__part_file__loinc_part_link_primary_csv()

count = 0
loinc_term: LoincTerm
for loinc_term in self.runtime.current_module.get_entities_of_type(
entity_class=LoincTerm
):
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_TERMS:
break
loinc_term_id = loinc_term.id
loinc_term_node = self.runtime.graph.get_node_by_code(
type_=LoincNodeType.LoincTerm, code=loinc_term_id
Expand Down Expand Up @@ -510,10 +527,14 @@ def loinc_term_supplementary_def(self):
loinc_loader = LoincLoader(graph=graph, configuration=self.configuration)
loinc_loader.load_accessory_files__part_file__loinc_part_link_supplementary_csv()

count = 0
loinc_term: LoincTerm
for loinc_term in self.runtime.current_module.get_entities_of_type(
entity_class=LoincTerm
):
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_TERMS:
break
loinc_term_id = loinc_term.id
loinc_term_node = self.runtime.graph.get_node_by_code(
type_=LoincNodeType.LoincTerm, code=loinc_term_id
Expand Down Expand Up @@ -618,6 +639,8 @@ def loinc_term_class_roots(self):
entity_class=LoincTerm
):
count += 1
if self.configuration.fast_run and count > FAST_RUN_N_TERMS:
break
if count % 1000 == 0:
logger.debug(f"Finished {count}")
loinc_term_node = self.runtime.graph.get_node_by_code(
Expand Down
5 changes: 5 additions & 0 deletions src/comp_loinc/snomed_builder_steps.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import typing as t

from comp_loinc import Runtime
from comp_loinc.config import FAST_RUN_N_PARTS
from comp_loinc.datamodel import SnomedConcept
from loinclib import (
Configuration,
Expand Down Expand Up @@ -47,7 +48,11 @@ def loinc_part_mapping_instances(self):
loader = LoincSnomedLoader(config=self.configuration, graph=self.runtime.graph)
loader.load_part_mapping()

count = 0
for part in self.runtime.graph.get_nodes(type_=LoincNodeType.LoincPart):
count = count + 1
if self.configuration.fast_run and count > FAST_RUN_N_PARTS:
break
for edge in part.get_all_out_edges():
if edge.edge_type.type_ is SnomedEdges.maps_to:
concept_id = edge.to_node.get_property(
Expand Down
39 changes: 27 additions & 12 deletions test/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,10 @@
from utils import sparql_ask, sparql_select


N_CLASSES_EXPECTED = 50 # arbitrary
N_CHILDREN_EXPECTED = 10 # arbitrary


class CompLoincTest(unittest.TestCase):
"""CompLOINC tests"""

Expand All @@ -45,22 +49,22 @@ def _part_model_unique_prop_counts(df: pd.DataFrame) -> Set[int]:
prop_counts: Dict[str, int] = df.groupby('class').size().to_dict()
return set(prop_counts.values())

def _test_n_classes(self, onto_filename: str, n=50) -> pd.DataFrame:
def _test_n_classes(self, onto_filename: str, n: int = N_CLASSES_EXPECTED) -> pd.DataFrame:
"""Test that n classes are present"""
df: pd.DataFrame = sparql_select(onto_filename, 'all-classes.sparql')
self.assertGreater(len(df), n)
self.assertGreaterEqual(len(df), n)
return df

def _test_no_nulls(self, df: pd.DataFrame):
"""Ensure no NULLs in DataFrame"""
self.assertFalse(bool(df.isna().any().any()))

def _tests_common(self, onto_filename: str):
def _tests_common(self, onto_filename: str, n_classes_expected: int = N_CLASSES_EXPECTED):
"""Common tests to run on all artefacts"""
df: pd.DataFrame = self._test_n_classes(onto_filename)
df: pd.DataFrame = self._test_n_classes(onto_filename, n_classes_expected)
self._test_no_nulls(df)

def _test_child_counts(self, onto_filename: str, n=10) -> pd.DataFrame:
def _test_child_counts(self, onto_filename: str, n: int = N_CHILDREN_EXPECTED) -> pd.DataFrame:
"""Test variation in child counts

:param n: How many children to expect. Default=10 (arbitrary)
Expand Down Expand Up @@ -89,8 +93,15 @@ def test_terms_list(self):
# Test that all top level branches have (arbitrary number of) descendants
df2: pd.DataFrame = sparql_select(onto_filename, 'terms-tree-top-branch-descendants.sparql')
branch_desc_counts: Dict[str, int] = df2.groupby('branch').size().to_dict()
for v in branch_desc_counts.values():
self.assertGreater(v, 5)
# - Set based on total observed descendants using --fast-run on 2024/11/12
expected = {
'https://loinc.org/LTC___Claims_attachments': 4,
'https://loinc.org/LTC___Clinical': 1459,
'https://loinc.org/LTC___Laboratory': 3031,
'https://loinc.org/LTC___Surveys': 451
}
for branch, count in branch_desc_counts.items():
self.assertGreaterEqual(count, expected[branch])

# Test variation in child counts
self._test_child_counts(onto_filename)
Expand Down Expand Up @@ -162,24 +173,28 @@ def test_term_supplementary_model(self):
self._test_no_nulls(df)

def test_snomed_part_mappings(self):
"""Tests for SNOMED part mappings"""
"""Tests for SNOMED part mappings

todo: `n` here is quite low for several assertions. This is due to the limited output that --fast-run provides.
see: https://github.com/loinc/comp-loinc/issues/116
"""
onto_filename = 'snomed-parts.owl'
main_branch = 'https://loinc.org/138875005' # SCT SNOMED CT Concept (SNOMED RT+CTV3)
self._tests_common(onto_filename)
self._tests_common(onto_filename, 15)
df: pd.DataFrame = sparql_select(onto_filename, 'top-branches.sparql')

# Test for a number of top level branches
self.assertGreaterEqual(len(df), 5) # n=arbitrary
self.assertGreaterEqual(len(df), 1)

# Test main branch exists
self.assertIn(main_branch, df['class'].values)

# Test variation in child counts
df2: pd.DataFrame = self._test_child_counts(onto_filename)
df2: pd.DataFrame = self._test_child_counts(onto_filename, 3)

# Test main branch has several children
main_branch_children: int = list(df2[df2['class'] == main_branch]['directSubclassCount'])[0]
self.assertGreaterEqual(main_branch_children, 5) # n=arbitrary
self.assertGreaterEqual(main_branch_children, 1)

def test_loinc_snomed_ontology_equivalence(self):
"""Tests for equivalence between CompLOINC and LOINC(-SNOMED) Ontology"""
Expand Down
2 changes: 1 addition & 1 deletion test/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
except ModuleNotFoundError:
from config import ROBOT_JAR_PATH, TEST_IN_DIR, TEST_OUT_DIR, TEST_SPARQL_QEURY_DIR

robot_cmd_options = ['robot', ROBOT_JAR_PATH]
robot_cmd_options = ['robot', f'java -jar {ROBOT_JAR_PATH}']


def _get_paths(onto_filename: str, sparql_filename: str) -> tuple[str, str, str]:
Expand Down
Loading