-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate chr_pos_ref_alt variant identifiers using reference assembly #5
Comments
Current idea is to use PGKB's variant table which provides locations (RefSeq chromosome + position) for all rsIDs in PGKB data. For example these are strings like Then we should be able to get the reference and context bases from the FASTA file, and alternate allele(s) from the genotypes information in the clinical alleles table. Quick check that parsing variant locations will be straightforward:
|
BTW, of the 8 variants missing locations, several are deprecated or merged, but a couple are valid current RS that we could get locations for from Ensembl or NCBI.
|
Should also confirm what we should do for variants like And then do we explode on the identifier? And what should we do about the genotypes? |
Note on ranges in locations: Logically I thought this should mean the indicated range is deleted from the reference, but I guess this was my mistake for trying to apply logic to variant descriptions... Some of them do line up in this way, for some the lengths match up but I don't see how they align to the actual reference, and for others I'm just clueless. Here are some examples:
Thoughts on this @tcezard @M-casado? I've dumped all the variants with range locations in a spreadsheet here in case it's useful. |
I've had a look at some of your example to try to see if there are enough of a pattern to find the algorithm. I'm posting the example for now:
rs201279313 is set the start to 127735888 in PharmGKB but dbsnp displays it two bases earlier
rs35068180 is set the end to 102845220 in PharmGKB but dbsnp displays it finishing on base after also in the second genotype the alleles are not separated by slash
|
Discussed a bit offline... I'll email PGKB to ask about how they represent the variant location. For now I can implement a (relatively) simple algorithm assuming the location specifies the reference allele, and skip & report if it's not found in the genotypes. This works for the first & last of Tim's examples, but not the other two. |
Took me some time to get a bit familiar with these files and formats, but here's my take on the issue. graph TD
J[PharmGKB]
H[FASTA files]
E[Clinival alleles table]
A[Variant table]
D[Generate 'chr_pos_ref_alt' identifier]
S[NCBI Genome Assembly]
J --> A
J --> E
S --> H
A --> |locations 'Chr+Pos'| D
H --> |Reference + context| D
E --> |Alternate alleles| D
Re. #5 (comment)I parsed as well the
The only change is that there's one more row with missing location. Re. #5 (comment)Correct me if I'm wrong, but given the way that we treat evidence strings normally (i.e. separating by variant), I think exploding by the
With regards to the rest of the evidence strings, we don't explode by genotype, do we? Re. #5 (comment)I checked a few examples like Tim and so far:
Regarding what @tcezard said:
The reason in this case is that the sequence is palindromic, and both PGKB and dbSNP are correct (i.e. resulting sequence change is
It has to do with where they took these variants from. In the variant info for this rsID (PA166156703), you can see that the source is PGKB or dbSNP:
I haven't checked it in detail, but seeing how it is all Adenines, I assume it's a similar case. |
Great find @M-casado. That makes sense. That's why there is a disconnect between the position and the alleles found. |
Thanks Marcos, taking your points in order:
|
Glad to know it helped.
|
On exploding by genotype, here are some examples, there are also examples in the test output but I realise that's really hard to read in github... You're absolutely correct that the uniqueness checks will need to be different from PharmGKB as opposed to ClinVar, and we should both document those and implement the duplication checks in this repo. |
@M-casado @tcezard Did a complete run using the code in the PR, here are all the variants where the reference (determined by location) was not found among the alleles annotated by PGKB:
I just checked the first one rs10170310 and that one's all correct, the annotation really does not contain the reference allele. Also summary stats from the full run if you're curious:
|
Great work, @apriltuesday. The explosion stats are very handy to know what increments the evidence strings. Re. the variants, what was the problem of not having the reference annotated in PGKB? I can't remember right now if that was a bottleneck at some step. Wouldn't we simply not have an evidence string for the referenced, but be still able to annotate perfectly the alternative alleles? |
It's because we can't tell if reference not being in PGKB means we're misinterpreting the coordinates / there's an error in the data (as in rs201279313), or the reference truly is not being annotated (e.g. rs10170310). I think at this point we need to figure out which are definitely in the first category, so we can send PGKB an email with some examples and ask for clarification. I've added them in a tab to the spreadsheet so we can chip away at them together, there's not too many so it's hopefully not a huge task. |
Whoops, probably shouldn't have closed this with the PR... |
I assume that rs201279313 and similar cases won't have as much impact, given their number and the fact that the outcome of the variant seems to be the same. We do need to handle them carefully, like you said, to get their reference if we are trusting the coordinates. |
I've made a first pass at this here and added my comments, if you want to take a look and add your thoughts (and check my work...) |
Necessary to get context bases for deletions, also removes need for API calls to Ensembl.
The text was updated successfully, but these errors were encountered: