Pipeline to provide evidence strings for Open Targets from PharmGKB
The pipeline only requires Python 3.8+.
Clone the repository (or download a tagged release)
and run python setup.py install
.
(For EVA users, you have to manually run the deployment script for now, pending automated deployment.)
For EVA, you should log on to Codon SLURM cluster and become
the EVA production user,
then refer to the private repository for values.
# The directory where subdirectories for each batch will be created
export BATCH_ROOT_BASE=
# Code location where repository is cloned
export CODE_ROOT=
# Path to GRCh38 RefSeq FASTA file
export FASTA_PATH=
# Year and month for the upcoming Open Targets release.
# For example, if you're processing data for “20.02” release, this variable will be set to `2020-02`.
export OT_RELEASE=YYYY-MM
# Create directory structure for holding all files for the current batch.
export BATCH_ROOT=${BATCH_ROOT_BASE}/batch-${OT_RELEASE}
export DATA_DIR=${BATCH_ROOT}/data
mkdir -p ${BATCH_ROOT} ${DATA_DIR}
cd ${BATCH_ROOT}
# Download data
wget https://api.pharmgkb.org/v1/download/file/data/clinicalAnnotations.zip
wget https://api.pharmgkb.org/v1/download/file/data/variants.zip
wget https://api.pharmgkb.org/v1/download/file/data/relationships.zip
unzip -j clinicalAnnotations.zip "*.tsv" -d $DATA_DIR
unzip -j clinicalAnnotations.zip "CREATED*.txt" -d $DATA_DIR
unzip -j variants.zip "*.tsv" -d $DATA_DIR
unzip -j relationships.zip "*.tsv" -d $DATA_DIR
rm clinicalAnnotations.zip variants.zip relationships.zip
# Set the created date
export CREATED_DATE=`ls $DATA_DIR/CREATED*.txt | sed 's/.*\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\).*/\1/'`
generate_evidence.py --data-dir $DATA_DIR --fasta $FASTA_PATH --created-date $CREATED_DATE --output-path evidence.json
# One-liner for EVA on SLURM
sbatch -t 02:00:00 --mem=8G -J pharmgkb-evidence -o pharmgkb-evidence.out -e pharmgkb-evidence.err \
--wrap="${CODE_ROOT}/env/bin/python ${CODE_ROOT}/bin/generate_evidence.py --data-dir $DATA_DIR --fasta $FASTA_PATH --created-date $CREATED_DATE --output-path evidence.json"
Update the metrics spreadsheet based on the output of the pipeline.
The evidence string file (evidence.json
) must be uploaded to the Open Targets Google Cloud Storage to the pharmacogenomics
folder and be named in the format cttv012-[yyyy]-[mm]-[dd].json.gz
(e.g. cttv012-2020-10-21.json.gz
).
Once the upload is complete, send an email to Open Targets (data [at] opentargets.org) containing the following information:
- The number of submitted evidence strings
- The PharmGKB release date
- The Ensembl release
- The EFO version used for mapping
- The
opentargets-pharmgkb
pipeline version - The Open Targets JSON schema version
Unless otherwise mentioned, data is taken directly from PharmGKB.
Field | Description | Example |
---|---|---|
datasourceId | Identifier for data source | "pharmgkb" |
datasourceVersion | Date when data dump was generated, formatted YYYY-MM-DD | "2023-08-05" |
datatypeId | Type of data corresponding to this evidence string (currently only clinical annotation) | "clinical_annotation" |
studyId | Clinical Annotation ID | "1449309937" |
evidenceLevel | Level of evidence (see here) | "1A" |
literature | List of PMIDs associated with this clinical annotation | ["11389482", "27857962"] |
genotypeId | VCF-style (chr_pos_ref_allele1,allele2 ) identifier of genotype; computed as described below |
"19_38499645_GGAG_G,GGAG" |
variantRsId | RS ID of variant | "rs121918596" |
variantFunctionalConsequenceId | Sequence Ontology term, from VEP | "SO_0001822" |
targetFromSourceId | Ensembl stable gene ID, from VEP (rsIDs) or PGKB mapped through BioMart (named alleles) | "ENSG00000196218" |
genotype | Genotype or allele string | SNP "TA" , indel "del/GAG" , repeat "(CA)16/(CA)17" , named allele "*6" |
genotypeAnnotationText | Full annotation string for genotype or allele | "Patients with the rs121918596 del/GAG genotype may develop malignant hyperthermia when treated with volatile anesthetics [...]" |
directionality | Allele function annotation (see Table 2 here) | "Decreased function" |
haplotypeId | Name of haplotype; can be an allele or a genotype | "CYP2B6*6" or "GSTT1 non-null/non-null" |
haplotypeFromSourceId | Internal PGKB identifier for the haplotype | "PA165818762" |
drugs | List of drugs (see below) | [{"drugFromSource": "ivacaftor"}, {"drugFromSource": "lumacaftor"}] |
pgxCategory | Pharmacogenomics phenotype category | "toxicity" |
phenotypeText | Phenotype name | "Malignant Hyperthermia" |
phenotypeFromSourceId | EFO ID of phenotype, mapped through ZOOMA / OXO | "Orphanet_423" |
Below is an example of a complete clinical annotation evidence string:
{
"datasourceId": "pharmgkb",
"datasourceVersion": "2023-08-05",
"datatypeId": "clinical_annotation",
"studyId": "1449309937",
"evidenceLevel": "1A",
"literature": [
"11389482",
"27857962"
],
"genotypeId": "19_38499645_GGAG_G,GGAG",
"variantRsId": "rs121918596",
"variantFunctionalConsequenceId": "SO_0001822",
"targetFromSourceId": "ENSG00000196218",
"genotype": "del/GAG",
"genotypeAnnotationText": "Patients with the rs121918596 del/GAG genotype may develop malignant hyperthermia when treated with volatile anesthetics (desflurane, enflurane, halothane, isoflurane, methoxyflurane, sevoflurane) and/or succinylcholine as compared to patients with the GAG/GAG genotype. Other genetic or clinical factors may also influence the risk for malignant hyperthermia.",
"drugs": [
{"drugFromSource": "succinylcholine"}
],
"pgxCategory": "toxicity",
"phenotypeText": "Malignant Hyperthermia",
"phenotypeFromSourceId": "Orphanet_423"
}
Other examples can be found in the tests, though keep in mind these may not represent real data.
graph TD
J[PharmGKB]
H[FASTA files]
E[Clinical alleles table]
A[Variant table]
D[Generate 'chr_pos_ref_allele1,allele2' identifier]
S[NCBI Genome Assembly]
J --> A
J --> E
S --> H
A --> |locations 'Chr+Pos'| D
H --> |Reference + context| D
E --> |Alternate alleles| D
The drugs
property is a list of structs with 2 keys:
drugFromSource
: name of the drug from PGKBdrugId
: CHEMBL ID, left empty in this pipeline but populated by Open Targets
Lists of drugs are kept together (rather than exploded into separate evidence strings) when they're known to be annotated as a drug combination.
Currently this is only when they're /
-separated and associated with a single PGKB chemical ID, as in ivacaftor / lumacaftor.