JBrowse2 tracks deconvolution #129

nekrut · 2024-10-10T15:10:44Z

VEuPathDB had JBrowse instances set up so that each config and tracks are done dynamically (not typical); REST service sits in front of database to write config files and extract data dynamically to display in JBrowse
For sequence and genes, .gff and .fasta; not accessed
There are many many tracks for plasmodium genome
PlasmoDB flat files are massive
Can see github repo where we could extract information from other data tracks and database features, can reverse engineer from here

nekrut · 2024-10-10T15:11:20Z

Since VEuPathDb is accessible now, can we do this via some kind of crawling @scottcain ?

nekrut · 2024-10-10T16:14:04Z

Contingent on our ability to query the db

maximilianh · 2024-10-10T16:52:26Z

Can't this be done via the jbrowser API endpoint, the one that jbrowse uses? Veupathdb staff told us how to do this...

…

On Thu, Oct 10, 2024 at 6:14 PM Anton Nekrutenko ***@***.***> wrote: Contingent on our ability to query the db — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TP2SNIMJXYHEJGCHSTZ22RWDAVCNFSM6AAAAABPXATNM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVGUZTKMJQGA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

nekrut · 2024-10-10T16:54:51Z

Can't this be done via the jbrowser API endpoint, the one that jbrowse uses? Veupathdb staff told us how to do this...
…
On Thu, Oct 10, 2024 at 6:14 PM Anton Nekrutenko @.> wrote: Contingent on our ability to query the db — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TP2SNIMJXYHEJGCHSTZ22RWDAVCNFSM6AAAAABPXATNM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVGUZTKMJQGA . You are receiving this because you are subscribed to this thread.Message ID: @.>

We just discussed this with @scottcain

maximilianh · 2024-10-10T16:58:50Z

Was the conclusion that it's possible using the API? (sorry I had a meeting at exactly the same time and duration as the BRC meeting today)

scottcain · 2024-10-10T19:32:09Z

@maximilianh it depends on what we want to do. Since these tracks' data are now static from VEuPathDB (that is, they aren't being actively curated, which would be a reason you might want to run the tracks from a database query), using the database to serve up JBrowse data is (in my opinion) a little obnoxious, when the alternative is something like tabix indexed GFF, BigBed or BigWig (along with associated metadata in all cases). Shifting to static files sitting in a bucket or web server somewhere would make implementing JBrowse 2 a lot easier (and presumably other genome browsers 😉), and is better for the community, since it makes getting a whole dataset for a given track a lot easier.

The next question is, could we write a spider that crawled all of the JBrowse instances at VEuPathDB and extracts all of the data? Well, yes, I suppose we could, but it would be kind of an unfriendly thing to do. Rather, I think a reasonable approach is wait for a new hire in Sergei's group who knows this database pretty well and can (presumably) help us extract the track data and metadata from our instance of the database.

maximilianh · 2024-10-10T20:31:12Z

Yes, I meant a spider. I don't know how unfriendly that would be, depends on how fast the transfer is... but OK, if someone starts with Sergei, then I guess there is an easier way. And yes, I convert everything to bigBed.

…

On Thu, Oct 10, 2024 at 9:32 PM Scott Cain ***@***.***> wrote: @maximilianh <https://github.com/maximilianh> it depends on what we want to do. Since these tracks' data are now static from VEuPathDB (that is, they aren't being actively curated, which would be a reason you might want to run the tracks from a database query), using the database to serve up JBrowse data is (in my opinion) a little obnoxious, when the alternative is something like tabix indexed GFF, BigBed or BigWig (along with associated metadata in all cases). Shifting to static files sitting in a bucket or web server somewhere would make implementing JBrowse 2 a lot easier (and presumably other genome browsers 😉), and is better for the community, since it makes getting a whole dataset for a given track a lot easier. The next question is, could we write a spider that crawled all of the JBrowse instances at VEuPathDB and extracts all of the data? Well, yes, I suppose we could, but it would be kind of an unfriendly thing to do. Rather, I think a reasonable approach is wait for a new hire in Sergei's group who knows this database pretty well and can (presumably) help us extract the track data and metadata from our instance of the database. — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TNJL7UK6ECFBJWBVH3Z23I47AVCNFSM6AAAAABPXATNM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVHA4DKOBTGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

maximilianh · 2024-10-10T20:31:24Z

Thanks for the quick reply! On Thu, Oct 10, 2024 at 10:30 PM Maximilian Haeussler ***@***.***> wrote:

…

Yes, I meant a spider. I don't know how unfriendly that would be, depends on how fast the transfer is... but OK, if someone starts with Sergei, then I guess there is an easier way. And yes, I convert everything to bigBed. On Thu, Oct 10, 2024 at 9:32 PM Scott Cain ***@***.***> wrote: > @maximilianh <https://github.com/maximilianh> it depends on what we want > to do. Since these tracks' data are now static from VEuPathDB (that is, > they aren't being actively curated, which would be a reason you might want > to run the tracks from a database query), using the database to serve up > JBrowse data is (in my opinion) a little obnoxious, when the alternative is > something like tabix indexed GFF, BigBed or BigWig (along with associated > metadata in all cases). Shifting to static files sitting in a bucket or web > server somewhere would make implementing JBrowse 2 a lot easier (and > presumably other genome browsers 😉), and is better for the community, > since it makes getting a whole dataset for a given track a lot easier. > > The next question is, could we write a spider that crawled all of the > JBrowse instances at VEuPathDB and extracts all of the data? Well, yes, I > suppose we could, but it would be kind of an unfriendly thing to do. > Rather, I think a reasonable approach is wait for a new hire in Sergei's > group who knows this database pretty well and can (presumably) help us > extract the track data and metadata from our instance of the database. > > — > Reply to this email directly, view it on GitHub > <#129 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AACL4TNJL7UK6ECFBJWBVH3Z23I47AVCNFSM6AAAAABPXATNM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBVHA4DKOBTGU> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

d-callan · 2024-11-13T20:13:59Z

so that was fun lol.. first thoughts as i read this:

im not completely sure a spider is too unfriendly, and might be handy
wrt using the db vs flat files, the db predated jbrowse. it already existed w the data needed and flat files didnt. it also meant one source of truth for all parts of the site rather than keeping the db and flat files in sync. if they manage sustainability, they also want to move to jbrowse2 and all flat files if they have the time and resources
id be generally curious what all tracks you wanted to reproduce, and their priority.
id also be curious where else you thought you might want to use some of these data. the db is a pretty purpose built thing. its job is to run a veupath genomics site. as much as there is a lot of great stuff in there, theres also some notable stuff missing, bc what wasnt intended to be used in a search or a plot wasnt loaded. that might impact your general strategy for trying to digest veupath data, which might impact plans for this task? dunno. to give an ex though, for rnaseq experiments tpm was loaded but not counts. if you wanted to try to incorporate these data in a different context in a way that would allow something like deseq to be used on them, thatd be a problem.
related to previous i suppose, i wonder what if anything besides the db you were given?

scottcain · 2024-11-13T22:25:16Z

@d-callan these are solid questions. Here are my thoughts:

im not completely sure a spider is too unfriendly, and might be handy

Perhaps, but I still think it would (likely) be cleaner to extract, format and config data coming directly from the database. And of course, such a spider doesn't actually exist except in my head.

wrt using the db vs flat files, the db predated jbrowse. it already existed w the data needed and flat files didnt. it also meant one source of truth for all parts of the site rather than keeping the db and flat files in sync. if they manage sustainability, they also want to move to jbrowse2 and all flat files if they have the time and resources

Also fair--I can understand that as a design decision; on the other hand, I very much like having JBrowse backed by flatfiles, since it makes it easy for users who want to get a genome's worth of a particular data set much easier.

id be generally curious what all tracks you wanted to reproduce, and their priority.

My philosophy can generally be stated as "if the data exists, let's get it in JBrowse" This can lead to a discoverability issue for assemblies with lots of data, but that's not a terrible problem to have, and tools are being developed to address it.

id also be curious where else you thought you might want to use some of these data. the db is a pretty purpose built thing. its job is to run a veupath genomics site. as much as there is a lot of great stuff in there, theres also some notable stuff missing, bc what wasnt intended to be used in a search or a plot wasnt loaded. that might impact your general strategy for trying to digest veupath data, which might impact plans for this task? dunno. to give an ex though, for rnaseq experiments tpm was loaded but not counts. if you wanted to try to incorporate these data in a different context in a way that would allow something like deseq to be used on them, thatd be a problem.

It is unclear to me if there is any real use for any of the other data in the database--when we got it, we were thinking VEuPathDB was going away for good. With it still existing, we can make better use of the website than trying to build something ourselves (in my opinion, that is).

related to previous i suppose, i wonder what if anything besides the db you were given?

I am reasonably sure that we got a bunch of binary files like bigwigs and bams, but I don't know that for sure (but I hope we did!) @jdavcs might know more

scottcain · 2024-11-13T22:51:11Z

Although, it just occurred to me that, since the JBrowse instance at VEuPathDB continues to exist, we could potentially proxy (to get around CORS issues) the paths where the data extraction tools are and just use their data directly. I will try experimenting with that today.

jdavcs · 2024-11-14T16:37:08Z

@scottcain @d-callan We have roughly 11TB of flat files representing 13 DBs. I don't know what exactly we have, but here's a tiny sample:

Pyoeliiyoelii17X
├── bam
├── bigwig
├── blast
├── blat
├── config
├── fasta
├── gff
├── highSpeedSnpSearch
├── longReadRNASeq
├── motif
└── nrProteinsToGenomeAlign

scottcain · 2024-11-14T18:16:14Z

More thoughts for @d-callan : what would be really cool would be a tool that, given the assembly we want it for, would extract the information needed from the database for what tracks are available and construct a config. There are a few details hidden in that one sentence, like

when the track is derived from a binary file like bigwig or bam, figuring out where that is in the file system John referred to above would be helpful (I'm guessing not too difficult)
when the track is derived from data in the database, it will have to be spit out into a flatfile that makes sense, and then probably be post processed (bgzipped and tabix indexed gff, for example)
generating JBrowse 2 configs would be helpful, rather than JBrowse 1 (not hard, just different)
generating JB2 configs also assumes that the assembly's fasta file is indexed in some way (like bgzip and faidx); that isn't your problem but something that will need to be addressed.

d-callan · 2024-11-20T15:50:17Z

so if were serious about this.. id say that tool already exists. its at/ sort of is veupath. which means for us that tool is a spider.. no?

ex: https://plasmodb.org/plasmo/service/jbrowse/tracks/pvivP01/trackList.json and follow it round.

maximilianh · 2024-11-20T16:07:06Z

I cannot imagine that downloading the annotation would ever be a big problem. The genome, all the annotations, are tiny, for a normal server. We're talking a few kb per track at most. I wouldn't download with 100 parallel threads, just one thread is enough. And yes, jbrowse itself also grabs an entire genome's worth of annotations when you zoom out.

…

On Wed, Nov 20, 2024 at 7:50 AM Danielle Callan ***@***.***> wrote: so if were serious about this.. id say that tool already exists. its at/ sort of is veupath. which means for us that tool is a spider.. no? ex: https://plasmodb.org/plasmo/service/jbrowse/tracks/pvivP01/trackList.json and follow it round. — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TJQ65FKBTW4XUQ2Z6L2BSVVBAVCNFSM6AAAAABPXATNM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBYHE2DOMZTGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

d-callan · 2024-12-09T16:00:50Z

so some questions relating to this:

How can we provide context for all these data? the dataset and sample display names from veupath are unlikely to be inherently meaningful to users.
For the experimental data, do we think ppl would want to play w it in galaxy? say do something like find differentially expressed genes in an rnaseq experiment, say? and if so, how can we facilitate that?

maximilianh · 2024-12-09T17:11:00Z

1. well, if we have no other meta data, what can we do? 2. the data will be shown on the UCSC browser, and to export to galaxy, the ucsc data download button has a "send to Galaxy" option.

…

On Mon, Dec 9, 2024 at 5:01 PM Danielle Callan ***@***.***> wrote: so some questions relating to this: 1. How can we provide context for all these data? the dataset and sample display names from veupath are unlikely to be inherently meaningful to users. 2. For the experimental data, do we think ppl would want to play w it in galaxy? say do something like find differentially expressed genes in an rnaseq experiment, say? and if so, how can we facilitate that? — Reply to this email directly, view it on GitHub <#129 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACL4TMBXKDHEO3XGYHLBGL2EW5EVAVCNFSM6AAAAABPXATNM6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMRYGQ4TSNBTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

d-callan · 2024-12-09T17:21:18Z

Idk tbh, links to veupath dataset record pages? Or if we tried to get metadata what could we do w it? What would be worth trying to get?
This is a thing I'd like to play w..

nekrut added this to BRC development tasks Sep 19, 2024

nekrut assigned scottcain Oct 10, 2024

nekrut converted this from a draft issue Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JBrowse2 tracks deconvolution #129

JBrowse2 tracks deconvolution #129

nekrut commented Oct 10, 2024

nekrut commented Oct 10, 2024 •

edited

Loading

nekrut commented Oct 10, 2024

maximilianh commented Oct 10, 2024 via email

nekrut commented Oct 10, 2024

maximilianh commented Oct 10, 2024

scottcain commented Oct 10, 2024

maximilianh commented Oct 10, 2024 via email

maximilianh commented Oct 10, 2024 via email

d-callan commented Nov 13, 2024

scottcain commented Nov 13, 2024

scottcain commented Nov 13, 2024

jdavcs commented Nov 14, 2024

scottcain commented Nov 14, 2024

d-callan commented Nov 20, 2024

maximilianh commented Nov 20, 2024 via email

d-callan commented Dec 9, 2024

maximilianh commented Dec 9, 2024 via email

d-callan commented Dec 9, 2024

JBrowse2 tracks deconvolution #129

JBrowse2 tracks deconvolution #129

Comments

nekrut commented Oct 10, 2024

nekrut commented Oct 10, 2024 • edited Loading

nekrut commented Oct 10, 2024

maximilianh commented Oct 10, 2024 via email

nekrut commented Oct 10, 2024

maximilianh commented Oct 10, 2024

scottcain commented Oct 10, 2024

maximilianh commented Oct 10, 2024 via email

maximilianh commented Oct 10, 2024 via email

d-callan commented Nov 13, 2024

scottcain commented Nov 13, 2024

scottcain commented Nov 13, 2024

jdavcs commented Nov 14, 2024

scottcain commented Nov 14, 2024

d-callan commented Nov 20, 2024

maximilianh commented Nov 20, 2024 via email

d-callan commented Dec 9, 2024

maximilianh commented Dec 9, 2024 via email

d-callan commented Dec 9, 2024

nekrut commented Oct 10, 2024 •

edited

Loading