This page provides a brief introduction and getting started guide for clinical data at UCLH.
UCLH clinical data is, by default, provided for research in a format called the OMOP Common Data Model (CDM), sometimes just OMOP for short.
The beauty of the CDM is that it allows for data from different locations to be combined. As OMOP is used by a large and growing number of researchers globally this opens up potential for collaboration and to contribute data to network studies. In addition there are analytical tools that run directly from CDM data including ones provided by OHDSI, the organisation that manages OMOP, and the Darwin EU project.
OMOP consists of two parts :
- a data structure defining required tables and columns.
- a vocabulary offering standardised IDs and names for nearly everything that can happen in a hospital. OMOP includes many other vocabularies (e.g. SNOMED, LOINC etc.) by assigning a unique OMOP concept id to each of their IDs.
The first part could be thought of as a filing cabinet with sections where data can go, the second part as a dictionary, allowing data elements to be looked up and standardised. This 2024 paper describes the OMOP vocabs.
UCLH clinical data are provided as a series of tables in the OMOP format. These can be provided in either parquet or csv format, or already uploaded into a database.
Here you can find details of the tables and columns making up an OMOP instance.
Here we will work through a reduced OMOP example to introduce you to how you can use the data.
We will start by considering four of the OMOP tables and selected fields/columns in each :
Person
Measurement
Observation
Drug_exposure
Person
has one row per patient and a column called person_id
that can link rows in most of the other tables to an individual. It also stores attributes of the individual including birth date, gender and ethnicity.
Measurement
has one row per measurement conducted on the patient and has columns storing the ID of the patient, the identity of the measurement, when it was conducted and the value recorded.
OMOP Table | Selected Table Columns |
---|---|
Person | person_id year_of_birth gender_concept_id gender_source_value |
Measurement | person_id measurement_id measurement_concept_id measurement_date value_as_number value_as_concept_id measurement_source_value |
Any column named *concept_id
contains OMOP concept IDs, integer values that are defined in the OMOP vocabulary where a corresponding name is stored for each ID.
There are different ways of looking up the concept names from IDs. One way is to use the omopcept R package. omopcept
provides a function omop_join_name_all() that will add concept_name
columns for all OMOP concept_id
columns in a table or list of tables. For example for the slimmed down Person
table described above it would add a gender_concept_name
column. These are the concepts for male and female.
gender_concept_id |
gender_concept_name |
---|---|
8532 | FEMALE |
8507 | MALE |
OMOP concepts can also be looked up in Athena an online tool provided by OHDSI, but this manual process would take a long time for more than a few concepts and is less reproducible.
concept_id
is the unique OMOP ID concept_code
is the ID in one of the source vocabularies e.g. SNOMED or LOINC
The OMOP vocabularies have a concept_code
field that contains the identifier in the source vocabulary, but most often you will want to use concept_id which is the unique OMOP ID.
You may notice that there are columns named *_source_value
in both the slimmed Person
and Measurement
tables. These store the values as recorded in the source data before it was mapped to OMOP (the values stored in EPIC in the case of UCLH). You may be tempted to use source value columns in your analyses but this is not recommended. The benefit of using OMOP is that you can combine your data/analysis with other sites because of the standardisation. If you use *_source_value
columns in your analysis you lose the benefit of standardisation. Using source values makes it unlikely that you'll be able to use data from another site in your analysis.
To look at patient attributes associated with measurements (or other omop tables) you can join the person
table onto the measurement
table using person_id
.
In R, code like this could be used to do the join :
library(dplyr)
mp <- Measurement |>
left_join(Person, by="person_id")
Be slightly careful that this creates a table that has multiple rows per patient ID.
Measurements are stored in a question-answer format. The question is represented by measurement_concept_id
& measurement_concept_name
. Answers are represented in value_as_number
for numeric values & value_as_concept_id
for values that can be represented by another concept_id
.
The Drug_exposure
and Observation
tables can be treated similarly to the Measurement
table. These are some of the most useful fields.
OMOP Table | Selected Table Columns |
---|---|
Person | person_id year_of_birth gender_concept_id gender_source_value |
Drug_exposure | person_id drug_exposure_id drug_concept_id drug_exposure_start_date drug_exposure_end_date quantity drug_source_value |
Observation | person_id observation_id observation_concept_id observation_date value_as_number value_as_concept_id value_as_string observation_source_value |
Note that Observation
has an additional column value_as_string
that is not present in Measurement
.
For any clinical event OHDSI defines a single Standard concept_id
. Whilst clinical events may be represented by a range of vocabularies (e.g. SNOMED, LOINC, ICD10) only one will be Standard
. For example the Standard vocabulary for conditions is SNOMED and for drugs is RxNorm or RxNorm Extension. Non-standard concepts can be included in source*
fields but should not be present in *concept_id
fields.
This has been a brief introduction to OMOP at UCLH. YOu can explore the links below to learn more. Also we will will be providing more detailed documentation soon.
-
The Book Of OHDSI A useful comprehensive community resource describing all things OHDSI & OMOP. A little dated now (from 2021).
-
Athena - online OMOP concept lookup provided by OHDSI
-
OMOP CDM - Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses. UCLH using OMOP since 2022 to provide data for research
-
OHDSI - Pronounced Odyssey. Observational Health Data Sciences and Informatics program. Maintains the OMOP Common Data Model
-
omopcept - an R package for querying and visualising omop concepts (with fewer cons!). Developed by Andy South at UCLH.
-
SNOMED CT - A structured clinical vocabulary. All NHS healthcare providers in England must use SNOMED CT for capturing clinical terms within electronic patient record systems. OMOP has a representation of SNOMED concepts.
-
OHDSI community forums where you can browse & ask community questions