Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate chr_pos_ref_alt variant identifiers using reference assembly #5

Closed
apriltuesday opened this issue Apr 24, 2023 · 18 comments · Fixed by #15
Closed

Generate chr_pos_ref_alt variant identifiers using reference assembly #5

apriltuesday opened this issue Apr 24, 2023 · 18 comments · Fixed by #15

Comments

@apriltuesday
Copy link
Collaborator

Necessary to get context bases for deletions, also removes need for API calls to Ensembl.

@apriltuesday
Copy link
Collaborator Author

Current idea is to use PGKB's variant table which provides locations (RefSeq chromosome + position) for all rsIDs in PGKB data. For example these are strings like NC_000001.11:46399999 or sometimes NC_000019.10:38499646_38499648.

Then we should be able to get the reference and context bases from the FASTA file, and alternate allele(s) from the genotypes information in the clinical alleles table.

Quick check that parsing variant locations will be straightforward:

variants = read_tsv_to_df('variants.tsv')

# 6643 total
len(variants)

# 6645 non-empty
variants['Location'].count()

# only the 8 empty locations
variants[~variants['Location'].str.match('NC_[0-9.]+:[0-9]+(_[0-9]+)?', na=False)]

# 38 with a range
variants[variants['Location'].str.count('_') > 1].dropna()

@apriltuesday
Copy link
Collaborator Author

apriltuesday commented Aug 9, 2023

BTW, of the 8 variants missing locations, several are deprecated or merged, but a couple are valid current RS that we could get locations for from Ensembl or NCBI.

rs1799735
rs2266637
rs29000568
rs35320474
rs36056065
rs3834935
rs4630
rs58425034

@apriltuesday
Copy link
Collaborator Author

apriltuesday commented Aug 10, 2023

Should also confirm what we should do for variants like rs10420097 (dbSNP report, PGKB annotation). Should this one have two identifiers corresponding to the two alternate alleles that have annotations, i.e. 19_57633193_A_C and 19_57633193_A_G?

And then do we explode on the identifier? And what should we do about the genotypes?

@apriltuesday
Copy link
Collaborator Author

Note on ranges in locations:

Logically I thought this should mean the indicated range is deleted from the reference, but I guess this was my mistake for trying to apply logic to variant descriptions... Some of them do line up in this way, for some the lengths match up but I don't see how they align to the actual reference, and for others I'm just clueless. Here are some examples:

Variant Location Genotype/Allele
rs36120609 NC_000001.11:109737631_109737635 TC/TC
rs36120609 NC_000001.11:109737631_109737635 TC/TCCTC
rs36120609 NC_000001.11:109737631_109737635 TCCTC/TCCTC
rs201279313 NC_000004.12:127735888_127735890 TTA/TTA
rs201279313 NC_000004.12:127735888_127735890 TTA/del
rs201279313 NC_000004.12:127735888_127735890 del/del
rs35068180 NC_000011.10:102845217_102845220 A/del
rs35068180 NC_000011.10:102845217_102845220 AA
rs35068180 NC_000011.10:102845217_102845220 del/del
rs746071566 NC_000013.11:48037783_48037801 GGAGTC/GGAGTC
rs746071566 NC_000013.11:48037783_48037801 GGAGTC/del
rs746071566 NC_000013.11:48037783_48037801 del/del

Thoughts on this @tcezard @M-casado? I've dumped all the variants with range locations in a spreadsheet here in case it's useful.

@tcezard
Copy link
Member

tcezard commented Aug 17, 2023

I've had a look at some of your example to try to see if there are enough of a pattern to find the algorithm. I'm posting the example for now:
rs36120609

REF: 31  35 
      |   |
      TCCTC

G1:TC/TC
      TC---
      TC---
G2:TC/TCCTC
      TC---
      TCCTC
G3:TCCTC/TCCTC
      TCCTC
      TCCTC

rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier

REF: 86  90
      |   |
      TTATT

G1:TTA/TTA
      TTATT
      TTATT
G2:TTA/del
      TTATT
      ---TT
G3:del/del
      ---TT
      ---TT

rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash

REF: 17  21
      |   |
      AAAAA
G1:A/del
      AAAAA
      AAAA-
G2:AA
      AAAAA
      AAAAA
G3:del/del
      AAAA-
      AAAA-

rs11322783

REF: 1415
      || 
      TT
G1:G/TT
      G-
      TT
G2:GG
      G-
      G-
G3:TT/TT
      TT
      TT

@apriltuesday
Copy link
Collaborator Author

Discussed a bit offline... I'll email PGKB to ask about how they represent the variant location. For now I can implement a (relatively) simple algorithm assuming the location specifies the reference allele, and skip & report if it's not found in the genotypes. This works for the first & last of Tim's examples, but not the other two.

@M-casado
Copy link
Collaborator

M-casado commented Aug 18, 2023

Took me some time to get a bit familiar with these files and formats, but here's my take on the issue.
As a summary, what you plan to do is the following, right?

graph TD
    J[PharmGKB]
    H[FASTA files]
    E[Clinival alleles table]
    A[Variant table]    
    D[Generate 'chr_pos_ref_alt' identifier]
    S[NCBI Genome Assembly]
    J --> A
    J --> E
    S --> H
    A --> |locations 'Chr+Pos'| D
    H --> |Reference + context| D
    E --> |Alternate alleles| D
Loading

Re. #5 (comment)

I parsed as well the variants.tsv file, and I think the numbers increased a bit since your last parsing (or the version you got), I think in the count there were some mixed numbers (6645 non-empty and 6643 total) 😅:

import pandas as pd
df = pd.read_table('variants.tsv')
total_rows = len(df)
print("TOTAL:\t", total_rows)
non_empty_counts = df.count()
columns_with_missing_values = non_empty_counts[non_empty_counts < total_rows]
print(columns_with_missing_values)
TOTAL:	 6797
Gene IDs        6151
Gene Symbols    6151
Location        6788

The only change is that there's one more row with missing location.

Re. #5 (comment)

Correct me if I'm wrong, but given the way that we treat evidence strings normally (i.e. separating by variant), I think exploding by the chr_pos_ref_alt identifier would make sense. Especially if we have different clinical annotations for each, although I doubt it. For the example you gave (rs10420097), the clinical count seems to be 1:

Variant ID Variant Name Gene IDs Gene Symbols Location Variant Annotation count Clinical Annotation count Level 1/2 Clinical Annotation count Guideline Annotation count Label Annotation count Synonyms
PA166176938 rs10420097 PA37582 ZNF211 NC_000019.10:57633193 1 1 0 0 0 NC_000019.10:g.57633193A>C, NC_000019.10:g.57633193A>C, ...

With regards to the rest of the evidence strings, we don't explode by genotype, do we?

Re. #5 (comment)

I checked a few examples like Tim and so far:

REF: 14 15
      |  | 
      T  T
G1:G/TT
      G-
      TT
G2:GG
      G-
      G-
G3:TT/TT
      TT
      TT
chr13:48037783-48037801 (GRCh38.p14)
REF: 782 783                801
       |   |                |
       G   GAGTCGGAGTCGGAGTCG
A1: delGAGTCG
       G   GAGTCGGAGTCG------
A2: dupGAGTCG
     782 783                 801   807
       |   |                 |     |
       G   GGAGTCGGAGTCGGAGTCGGAGTCG
A3: dup(GAGTCG)₂
     782 783                 801   807   813
       |   |                 |     |     |
       G   GGAGTCGGAGTCGGAGTCGGAGTCGGAGTCG
chr11:102845217-102845221
REF: ..217 ..221
         |     |
         A AAA A
G1:delA
     ..217 ..221
         |     |
         A AAA -
G2:dupA
     ..217   ..222
         |       |
         A AAA A A
G3:dupAA
     ..217    ..223
         |        |
         A AAA A AA

Regarding what @tcezard said:

rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier

The reason in this case is that the sequence is palindromic, and both PGKB and dbSNP are correct (i.e. resulting sequence change is TTATT --> TT):

chr4:127735886-127735890
REF: 86    90
      |     |
      T TAT T
      
- dbSNP: delATT
     86    90
      |     |
      T T-- -
      
- PGKB: delTTA
     86    90
      |     |
      - --T T

It has to do with where they took these variants from. In the variant info for this rsID (PA166156703), you can see that the source is PGKB or dbSNP:
image

rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash

I haven't checked it in detail, but seeing how it is all Adenines, I assume it's a similar case.

@tcezard
Copy link
Member

tcezard commented Aug 18, 2023

Great find @M-casado. That makes sense.
rs201279313 is defined as TTA > - in GRCh37 by PharmGKB and this is the value that is being provided in clinical ann alleles.tsv table
It is then combine with the position of the variants.tsv table which provide the GRCh38 position.

That's why there is a disconnect between the position and the alleles found.
I'm wondering if there is a way to detect if the allele has been define in a different assembly.

@apriltuesday
Copy link
Collaborator Author

Thanks Marcos, taking your points in order:

  1. The diagram is beautiful 🎨 and exactly the approach I was taking. I might steal it for the readme actually...
  2. You're totally right that the data I'm using is out of date, it's actually from September 2022 (which gives you an idea of how much the data is changing in a year...)
  3. We are currently exploding the evidence by genotype, there are not multiple clinical annotation IDs per genotype but there are different phenotype strings (i.e. these) that contain relevant information (which we're not parsing, but OT might). Tim has also pointed out that the variant identifier OT uses is allele-centric but the various annotations in PGKB are genotype-centric. I think I'll create a separate issue for this because it needs some discussion.
  4. I agree that the ranges are generally interpretable (though I think in your examples you're checking the locations/alleles from dbSNP, which I feel are more consistent than in PharmGKB). The question is whether we can algorithmically translate them into consistent chr_pos_ref_alt identifiers.

@M-casado
Copy link
Collaborator

Glad to know it helped.

  1. @apriltuesday, feel free to steal the diagram :)
  2. If we are indeed exploding by genotype, I may need more details to understand how, because I always took the uniqueness tuple of an evidence string that I wrote here years ago, and I don't see how it contains any information about the pair of alleles, rather than a single allele.
  3. True, I was using the rsIDs, but also I think I checked them from within the spreadsheet you provided. Perhaps I should double check. And I guess we cannot disregard PGMKB as the source because even if we took the coordinates and context from NCBI, the link between them and the annotation wouldn't be straightforward to make (?)

@apriltuesday
Copy link
Collaborator Author

On exploding by genotype, here are some examples, there are also examples in the test output but I realise that's really hard to read in github...

You're absolutely correct that the uniqueness checks will need to be different from PharmGKB as opposed to ClinVar, and we should both document those and implement the duplication checks in this repo.

@apriltuesday
Copy link
Collaborator Author

@M-casado @tcezard Did a complete run using the code in the PR, here are all the variants where the reference (determined by location) was not found among the alleles annotated by PGKB:

Variant Location Reference Alleles
rs10170310 NC_000002.12:138521352 G C,A
rs1048943 NC_000015.10:74720644 T A,G
rs111618861 NC_000008.11:56131824_56131828 CAAAAA C,CA
rs11280056 NC_000018.10:673444 T TTAAAGTTA,TTA
rs144854329 NC_000006.12:43768679_43768698 TGGTCCCACTCTTCCCACAGG T,TGGTCCCACTCTTCCCACA
rs17885382 NC_000006.12:32584318 C T,A
rs201279313 NC_000004.12:127735888_127735890 TATT T,TTTA
rs2032582 NC_000007.14:87531302 A T,C
rs2228171 NC_000002.12:169196995 C A,G
rs2298383 NC_000022.11:24429543 C A,G
rs28364032 NC_000017.11:45834976 G T,C
rs35068180 NC_000011.10:102845217_102845220 GAAAA GA,G
rs57064725 NC_000015.10:78540694 C A,G
rs61767072 NC_000004.12:3767577_3767588 GGGCCGGGGGCGG GGGGGCGGGGCCG,G
rs61824877 NC_000001.11:200273504 G T,A
rs700518 NC_000015.10:51236915 T C,A
rs71486745 NC_000010.11:94936018_94936021 CTGTG C,CGT
rs72549309 NC_000001.11:97740411_97740418 GATGAATGA G,GATGA
rs746071566 NC_000013.11:48037783_48037801 AGGAGTCGGAGTCGGAGTCG A,AGGAGTC

I just checked the first one rs10170310 and that one's all correct, the annotation really does not contain the reference allele.

Also summary stats from the full run if you're curious:

Total clinical annotations: 5073
        With RS: 4477 (88.25%)
                1. Exploded by allele: 13497 (3.0x)
                2. Exploded by drug: 18830 (1.4x)
                3. Exploded by phenotype: 23086 (1.2x)
Total evidence strings: 25401
        With CHEBI: 21154 (83.28%)
        With EFO phenotype: 8381 (32.99%)
        With functional consequence: 23271 (91.61%)
        With VEP gene: 23271 (91.61%)
Gene comparisons per annotation
        With PGKB genes: 4220 (83.19%)
        With VEP genes: 4093 (80.68%)
        PGKB genes != VEP genes: 774 (15.26%)

@M-casado
Copy link
Collaborator

Great work, @apriltuesday. The explosion stats are very handy to know what increments the evidence strings.

Re. the variants, what was the problem of not having the reference annotated in PGKB? I can't remember right now if that was a bottleneck at some step. Wouldn't we simply not have an evidence string for the referenced, but be still able to annotate perfectly the alternative alleles?

@apriltuesday
Copy link
Collaborator Author

It's because we can't tell if reference not being in PGKB means we're misinterpreting the coordinates / there's an error in the data (as in rs201279313), or the reference truly is not being annotated (e.g. rs10170310).

I think at this point we need to figure out which are definitely in the first category, so we can send PGKB an email with some examples and ask for clarification. I've added them in a tab to the spreadsheet so we can chip away at them together, there's not too many so it's hopefully not a huge task.

@apriltuesday
Copy link
Collaborator Author

Whoops, probably shouldn't have closed this with the PR...

@apriltuesday apriltuesday reopened this Aug 24, 2023
@M-casado
Copy link
Collaborator

I assume that rs201279313 and similar cases won't have as much impact, given their number and the fact that the outcome of the variant seems to be the same. We do need to handle them carefully, like you said, to get their reference if we are trusting the coordinates.

@apriltuesday
Copy link
Collaborator Author

I've made a first pass at this here and added my comments, if you want to take a look and add your thoughts (and check my work...)

@apriltuesday
Copy link
Collaborator Author

Closing this, as after #37 the only variants without identifiers are repeats (#17) and mitochondrial (#38).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants