Generate chr_pos_ref_alt variant identifiers using reference assembly #5

apriltuesday · 2023-04-24T14:06:50Z

Necessary to get context bases for deletions, also removes need for API calls to Ensembl.

apriltuesday · 2023-08-09T08:28:13Z

Current idea is to use PGKB's variant table which provides locations (RefSeq chromosome + position) for all rsIDs in PGKB data. For example these are strings like NC_000001.11:46399999 or sometimes NC_000019.10:38499646_38499648.

Then we should be able to get the reference and context bases from the FASTA file, and alternate allele(s) from the genotypes information in the clinical alleles table.

Quick check that parsing variant locations will be straightforward:

variants = read_tsv_to_df('variants.tsv')

# 6643 total
len(variants)

# 6645 non-empty
variants['Location'].count()

# only the 8 empty locations
variants[~variants['Location'].str.match('NC_[0-9.]+:[0-9]+(_[0-9]+)?', na=False)]

# 38 with a range
variants[variants['Location'].str.count('_') > 1].dropna()

apriltuesday · 2023-08-09T12:46:44Z

BTW, of the 8 variants missing locations, several are deprecated or merged, but a couple are valid current RS that we could get locations for from Ensembl or NCBI.

apriltuesday · 2023-08-10T15:01:41Z

Should also confirm what we should do for variants like rs10420097 (dbSNP report, PGKB annotation). Should this one have two identifiers corresponding to the two alternate alleles that have annotations, i.e. 19_57633193_A_C and 19_57633193_A_G?

And then do we explode on the identifier? And what should we do about the genotypes?

apriltuesday · 2023-08-16T09:26:20Z

Note on ranges in locations:

Logically I thought this should mean the indicated range is deleted from the reference, but I guess this was my mistake for trying to apply logic to variant descriptions... Some of them do line up in this way, for some the lengths match up but I don't see how they align to the actual reference, and for others I'm just clueless. Here are some examples:

Variant	Location	Genotype/Allele
rs36120609	NC_000001.11:109737631_109737635	TC/TC
rs36120609	NC_000001.11:109737631_109737635	TC/TCCTC
rs36120609	NC_000001.11:109737631_109737635	TCCTC/TCCTC
rs201279313	NC_000004.12:127735888_127735890	TTA/TTA
rs201279313	NC_000004.12:127735888_127735890	TTA/del
rs201279313	NC_000004.12:127735888_127735890	del/del
rs35068180	NC_000011.10:102845217_102845220	A/del
rs35068180	NC_000011.10:102845217_102845220	AA
rs35068180	NC_000011.10:102845217_102845220	del/del
rs746071566	NC_000013.11:48037783_48037801	GGAGTC/GGAGTC
rs746071566	NC_000013.11:48037783_48037801	GGAGTC/del
rs746071566	NC_000013.11:48037783_48037801	del/del

Thoughts on this @tcezard @M-casado? I've dumped all the variants with range locations in a spreadsheet here in case it's useful.

tcezard · 2023-08-17T11:59:40Z

I've had a look at some of your example to try to see if there are enough of a pattern to find the algorithm. I'm posting the example for now:
rs36120609

REF: 31  35 
      |   |
      TCCTC

G1:TC/TC
      TC---
      TC---
G2:TC/TCCTC
      TC---
      TCCTC
G3:TCCTC/TCCTC
      TCCTC
      TCCTC

rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier

REF: 86  90
      |   |
      TTATT

G1:TTA/TTA
      TTATT
      TTATT
G2:TTA/del
      TTATT
      ---TT
G3:del/del
      ---TT
      ---TT

rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash

REF: 17  21
      |   |
      AAAAA
G1:A/del
      AAAAA
      AAAA-
G2:AA
      AAAAA
      AAAAA
G3:del/del
      AAAA-
      AAAA-

rs11322783

REF: 1415
      || 
      TT
G1:G/TT
      G-
      TT
G2:GG
      G-
      G-
G3:TT/TT
      TT
      TT

apriltuesday · 2023-08-17T14:32:42Z

Discussed a bit offline... I'll email PGKB to ask about how they represent the variant location. For now I can implement a (relatively) simple algorithm assuming the location specifies the reference allele, and skip & report if it's not found in the genotypes. This works for the first & last of Tim's examples, but not the other two.

M-casado · 2023-08-18T10:33:40Z

Took me some time to get a bit familiar with these files and formats, but here's my take on the issue.
As a summary, what you plan to do is the following, right?

graph TD
    J[PharmGKB]
    H[FASTA files]
    E[Clinival alleles table]
    A[Variant table]    
    D[Generate 'chr_pos_ref_alt' identifier]
    S[NCBI Genome Assembly]
    J --> A
    J --> E
    S --> H
    A --> |locations 'Chr+Pos'| D
    H --> |Reference + context| D
    E --> |Alternate alleles| D

Re. #5 (comment)

I parsed as well the variants.tsv file, and I think the numbers increased a bit since your last parsing (or the version you got), I think in the count there were some mixed numbers (6645 non-empty and 6643 total) 😅:

import pandas as pd
df = pd.read_table('variants.tsv')
total_rows = len(df)
print("TOTAL:\t", total_rows)
non_empty_counts = df.count()
columns_with_missing_values = non_empty_counts[non_empty_counts < total_rows]
print(columns_with_missing_values)

TOTAL:	 6797
Gene IDs        6151
Gene Symbols    6151
Location        6788

The only change is that there's one more row with missing location.

Re. #5 (comment)

Correct me if I'm wrong, but given the way that we treat evidence strings normally (i.e. separating by variant), I think exploding by the chr_pos_ref_alt identifier would make sense. Especially if we have different clinical annotations for each, although I doubt it. For the example you gave (rs10420097), the clinical count seems to be 1:

Variant ID	Variant Name	Gene IDs	Gene Symbols	Location	Variant Annotation count	Clinical Annotation count	Level 1/2 Clinical Annotation count	Guideline Annotation count	Label Annotation count	Synonyms
PA166176938	rs10420097	PA37582	ZNF211	NC_000019.10:57633193	1	1	0	0	0	NC_000019.10:g.57633193A>C, NC_000019.10:g.57633193A>C, ...

With regards to the rest of the evidence strings, we don't explode by genotype, do we?

Re. #5 (comment)

I checked a few examples like Tim and so far:

rs11322783 - Interpretable range

REF: 14 15
      |  | 
      T  T
G1:G/TT
      G-
      TT
G2:GG
      G-
      G-
G3:TT/TT
      TT
      TT

rs746071566 - Interpretable range

chr13:48037783-48037801 (GRCh38.p14)
REF: 782 783                801
       |   |                |
       G   GAGTCGGAGTCGGAGTCG
A1: delGAGTCG
       G   GAGTCGGAGTCG------
A2: dupGAGTCG
     782 783                 801   807
       |   |                 |     |
       G   GGAGTCGGAGTCGGAGTCGGAGTCG
A3: dup(GAGTCG)₂
     782 783                 801   807   813
       |   |                 |     |     |
       G   GGAGTCGGAGTCGGAGTCGGAGTCGGAGTCG

rs35068180 - Interpretable range

chr11:102845217-102845221
REF: ..217 ..221
         |     |
         A AAA A
G1:delA
     ..217 ..221
         |     |
         A AAA -
G2:dupA
     ..217   ..222
         |       |
         A AAA A A
G3:dupAA
     ..217    ..223
         |        |
         A AAA A AA

Regarding what @tcezard said:

rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier

The reason in this case is that the sequence is palindromic, and both PGKB and dbSNP are correct (i.e. resulting sequence change is TTATT --> TT):

chr4:127735886-127735890
REF: 86    90
      |     |
      T TAT T
      
- dbSNP: delATT
     86    90
      |     |
      T T-- -
      
- PGKB: delTTA
     86    90
      |     |
      - --T T

It has to do with where they took these variants from. In the variant info for this rsID (PA166156703), you can see that the source is PGKB or dbSNP:

rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash

I haven't checked it in detail, but seeing how it is all Adenines, I assume it's a similar case.

tcezard · 2023-08-18T11:25:48Z

Great find @M-casado. That makes sense.
rs201279313 is defined as TTA > - in GRCh37 by PharmGKB and this is the value that is being provided in clinical ann alleles.tsv table
It is then combine with the position of the variants.tsv table which provide the GRCh38 position.

That's why there is a disconnect between the position and the alleles found.
I'm wondering if there is a way to detect if the allele has been define in a different assembly.

apriltuesday · 2023-08-18T11:28:47Z

Thanks Marcos, taking your points in order:

The diagram is beautiful 🎨 and exactly the approach I was taking. I might steal it for the readme actually...
You're totally right that the data I'm using is out of date, it's actually from September 2022 (which gives you an idea of how much the data is changing in a year...)
We are currently exploding the evidence by genotype, there are not multiple clinical annotation IDs per genotype but there are different phenotype strings (i.e. these) that contain relevant information (which we're not parsing, but OT might). Tim has also pointed out that the variant identifier OT uses is allele-centric but the various annotations in PGKB are genotype-centric. I think I'll create a separate issue for this because it needs some discussion.
I agree that the ranges are generally interpretable (though I think in your examples you're checking the locations/alleles from dbSNP, which I feel are more consistent than in PharmGKB). The question is whether we can algorithmically translate them into consistent chr_pos_ref_alt identifiers.

M-casado · 2023-08-18T11:59:19Z

Glad to know it helped.

@apriltuesday, feel free to steal the diagram :)
If we are indeed exploding by genotype, I may need more details to understand how, because I always took the uniqueness tuple of an evidence string that I wrote here years ago, and I don't see how it contains any information about the pair of alleles, rather than a single allele.
True, I was using the rsIDs, but also I think I checked them from within the spreadsheet you provided. Perhaps I should double check. And I guess we cannot disregard PGMKB as the source because even if we took the coordinates and context from NCBI, the link between them and the annotation wouldn't be straightforward to make (?)

apriltuesday · 2023-08-21T08:28:01Z

On exploding by genotype, here are some examples, there are also examples in the test output but I realise that's really hard to read in github...

You're absolutely correct that the uniqueness checks will need to be different from PharmGKB as opposed to ClinVar, and we should both document those and implement the duplication checks in this repo.

apriltuesday · 2023-08-22T10:30:30Z

@M-casado @tcezard Did a complete run using the code in the PR, here are all the variants where the reference (determined by location) was not found among the alleles annotated by PGKB:

Variant	Location	Reference	Alleles
rs10170310	NC_000002.12:138521352	G	C,A
rs1048943	NC_000015.10:74720644	T	A,G
rs111618861	NC_000008.11:56131824_56131828	CAAAAA	C,CA
rs11280056	NC_000018.10:673444	T	TTAAAGTTA,TTA
rs144854329	NC_000006.12:43768679_43768698	TGGTCCCACTCTTCCCACAGG	T,TGGTCCCACTCTTCCCACA
rs17885382	NC_000006.12:32584318	C	T,A
rs201279313	NC_000004.12:127735888_127735890	TATT	T,TTTA
rs2032582	NC_000007.14:87531302	A	T,C
rs2228171	NC_000002.12:169196995	C	A,G
rs2298383	NC_000022.11:24429543	C	A,G
rs28364032	NC_000017.11:45834976	G	T,C
rs35068180	NC_000011.10:102845217_102845220	GAAAA	GA,G
rs57064725	NC_000015.10:78540694	C	A,G
rs61767072	NC_000004.12:3767577_3767588	GGGCCGGGGGCGG	GGGGGCGGGGCCG,G
rs61824877	NC_000001.11:200273504	G	T,A
rs700518	NC_000015.10:51236915	T	C,A
rs71486745	NC_000010.11:94936018_94936021	CTGTG	C,CGT
rs72549309	NC_000001.11:97740411_97740418	GATGAATGA	G,GATGA
rs746071566	NC_000013.11:48037783_48037801	AGGAGTCGGAGTCGGAGTCG	A,AGGAGTC

I just checked the first one rs10170310 and that one's all correct, the annotation really does not contain the reference allele.

Also summary stats from the full run if you're curious:

Total clinical annotations: 5073
        With RS: 4477 (88.25%)
                1. Exploded by allele: 13497 (3.0x)
                2. Exploded by drug: 18830 (1.4x)
                3. Exploded by phenotype: 23086 (1.2x)
Total evidence strings: 25401
        With CHEBI: 21154 (83.28%)
        With EFO phenotype: 8381 (32.99%)
        With functional consequence: 23271 (91.61%)
        With VEP gene: 23271 (91.61%)
Gene comparisons per annotation
        With PGKB genes: 4220 (83.19%)
        With VEP genes: 4093 (80.68%)
        PGKB genes != VEP genes: 774 (15.26%)

M-casado · 2023-08-23T17:37:09Z

Great work, @apriltuesday. The explosion stats are very handy to know what increments the evidence strings.

Re. the variants, what was the problem of not having the reference annotated in PGKB? I can't remember right now if that was a bottleneck at some step. Wouldn't we simply not have an evidence string for the referenced, but be still able to annotate perfectly the alternative alleles?

apriltuesday · 2023-08-24T07:27:36Z

It's because we can't tell if reference not being in PGKB means we're misinterpreting the coordinates / there's an error in the data (as in rs201279313), or the reference truly is not being annotated (e.g. rs10170310).

I think at this point we need to figure out which are definitely in the first category, so we can send PGKB an email with some examples and ask for clarification. I've added them in a tab to the spreadsheet so we can chip away at them together, there's not too many so it's hopefully not a huge task.

apriltuesday · 2023-08-24T08:19:33Z

Whoops, probably shouldn't have closed this with the PR...

M-casado · 2023-08-24T08:31:17Z

I assume that rs201279313 and similar cases won't have as much impact, given their number and the fact that the outcome of the variant seems to be the same. We do need to handle them carefully, like you said, to get their reference if we are trusting the coordinates.

apriltuesday · 2023-08-25T15:20:14Z

I've made a first pass at this here and added my comments, if you want to take a look and add your thoughts (and check my work...)

apriltuesday · 2024-04-22T12:38:44Z

Closing this, as after #37 the only variants without identifiers are repeats (#17) and mitochondrial (#38).

apriltuesday mentioned this issue Aug 15, 2023

Generate chr_pos_ref_alt variant identifiers using reference assembly #15

Merged

apriltuesday mentioned this issue Aug 21, 2023

Handle multiple alt alleles #18

Closed

apriltuesday closed this as completed in #15 Aug 24, 2023

apriltuesday reopened this Aug 24, 2023

apriltuesday mentioned this issue Mar 15, 2024

Compare genes from VEP and genes from PGKB #21

Closed

apriltuesday closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate chr_pos_ref_alt variant identifiers using reference assembly #5

Generate chr_pos_ref_alt variant identifiers using reference assembly #5

apriltuesday commented Apr 24, 2023

apriltuesday commented Aug 9, 2023

apriltuesday commented Aug 9, 2023 •

edited

Loading

apriltuesday commented Aug 10, 2023 •

edited

Loading

apriltuesday commented Aug 16, 2023

tcezard commented Aug 17, 2023

apriltuesday commented Aug 17, 2023

M-casado commented Aug 18, 2023 •

edited

Loading

tcezard commented Aug 18, 2023

apriltuesday commented Aug 18, 2023

M-casado commented Aug 18, 2023

apriltuesday commented Aug 21, 2023

apriltuesday commented Aug 22, 2023

M-casado commented Aug 23, 2023

apriltuesday commented Aug 24, 2023

apriltuesday commented Aug 24, 2023

M-casado commented Aug 24, 2023

apriltuesday commented Aug 25, 2023

apriltuesday commented Apr 22, 2024

Generate chr_pos_ref_alt variant identifiers using reference assembly #5

Generate chr_pos_ref_alt variant identifiers using reference assembly #5

Comments

apriltuesday commented Apr 24, 2023

apriltuesday commented Aug 9, 2023

apriltuesday commented Aug 9, 2023 • edited Loading

apriltuesday commented Aug 10, 2023 • edited Loading

apriltuesday commented Aug 16, 2023

tcezard commented Aug 17, 2023

apriltuesday commented Aug 17, 2023

M-casado commented Aug 18, 2023 • edited Loading

Re. #5 (comment)

Re. #5 (comment)

Re. #5 (comment)

tcezard commented Aug 18, 2023

apriltuesday commented Aug 18, 2023

M-casado commented Aug 18, 2023

apriltuesday commented Aug 21, 2023

apriltuesday commented Aug 22, 2023

M-casado commented Aug 23, 2023

apriltuesday commented Aug 24, 2023

apriltuesday commented Aug 24, 2023

M-casado commented Aug 24, 2023

apriltuesday commented Aug 25, 2023

apriltuesday commented Apr 22, 2024

apriltuesday commented Aug 9, 2023 •

edited

Loading

apriltuesday commented Aug 10, 2023 •

edited

Loading

M-casado commented Aug 18, 2023 •

edited

Loading