Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing decoration items with UCSC genome browser #48

Open
mrvollger opened this issue Jul 19, 2024 · 36 comments
Open

Missing decoration items with UCSC genome browser #48

mrvollger opened this issue Jul 19, 2024 · 36 comments

Comments

@mrvollger
Copy link

mrvollger commented Jul 19, 2024

Hello,

I have found a discrepancy in the number of fields and some other metadata when using bigtools vs ucsc and it is having an impact on how things appear in the browser.

Basically, when I test with bigbedinfo I see differences in field number and other metadata columns (the correct number of fields is 17):

# bigtools
version: 4
fieldCount: 3
isCompressed: yes
isSwapped: 0
itemCount: 1100122467552
primaryDataSize: 367,623,108
primaryIndexSize: 20,200
zoomLevels: 6
chromCount: 2
basesCovered: 97,494,196
meanDepth: 44.449296
minDepth: 1.000000
maxDepth: 3195.000000
std of depth: 69.740202
version: 4

# UCSC
fieldCount: 17
isCompressed: yes
isSwapped: 0
itemCount: 1100122467552
primaryDataSize: 386,514,961
primaryIndexSize: 46,952
zoomLevels: 8
chromCount: 2
basesCovered: 101,269,035
meanDepth: 82.861921
minDepth: 1.000000
maxDepth: 4056.000000
std of depth: 96.352168

I have uploaded all the data in my test to:
https://s3-us-west-2.amazonaws.com/stergachis-public1/index.html?prefix=Mitchell/temp/bigtools-test/

And here is the script I am running:

zcat my.bed.gz | bigtools bedtobigbed -a decoration.as -s start - hg38.fai bigtools.bb
bedToBigBed -as=decoration.as -type=bed12+ my.bed.gz hg38.fai ucsc.bb

bigbedinfo bigtools.bb
bigbedinfo ucsc.bb

Any help with an obvious error on my part would be great!

Cheers,
Mitchell

@jackh726
Copy link
Owner

working on this, should have a solution soon

@mrvollger
Copy link
Author

Sorry to be a bother but have you had a chance to look at this?

@jackh726
Copy link
Owner

Sorry for the delay - just published bigtools 0.5.1 which should fix the fieldCount and definedFieldCount when an autosql is passed.

@mrvollger
Copy link
Author

Awesome, it's working for me. I'll approve the bioconda PR once it passes tests :)

@mrvollger
Copy link
Author

Hi @jackh726,

Found an issue with the new release that I think is requiring the autosql argument to be passed everytime.

I have a standard three column bedfile and when I try:

zcat results/ONT_heuristic/coverage/unreliable-coverage-regions.bed.gz | bigtools bedtobigbed -s start  - ~/assemblies/hg38.analysisSet.fa.fai tmp.bb

I get

Error: Os { code: 2, kind: NotFound, message: "No such file or directory" }

But if I add the SQL argument, it works:

zcat results/ONT_heuristic/coverage/unreliable-coverage-regions.bed.gz | bigtools bedtobigbed -s start -a tmp.as - ~/assemblies/hg38.analysisSet.fa.fai tmp.bb

And here tmp.as is just the three-column definition:

table bed3 "bed3"
(
string  chrom;  "Reference sequence chromosome or scaffold"
uint    chromStart; "Start position in chromosome"
uint    chromEnd;   "End position in chromosome"
)

I am also noticing that I have to filter out lines starting with the BED comment character (#), whereas I don't think I needed to before, but I am not sure about that.

For now, I can add the as files for all my commands, but I wanted to let you know.

Cheers,
Mitchell

@mrvollger mrvollger reopened this Jul 30, 2024
@jackh726
Copy link
Owner

v0.5.2 should at least not error when not passing an autosql (this only happened when passing the bed via stdin)

I still need to add the following:

  1. autosql parsing when using stdin
  2. Filtering of commented lines

@mrvollger
Copy link
Author

I have not had time to put together a test case to share yet, but now it seems to me that the --autosql argument is ignored when using stdin with 0.5.2 whereas it worked with 0.5.1. But I am realizing maybe that is what you meant with:

autosql parsing when using stdin

?

@jackh726
Copy link
Owner

Ah no, that should work but I made a mistake. Will update with fix either later today or tomorrow.

@mrvollger
Copy link
Author

Great thank you!

Sorry for all the questions/comments, but I think I am also noticing that setting the zoom level on the command line is not respected when I check after the file is created with bigbedinfo.

@jackh726
Copy link
Owner

jackh726 commented Aug 14, 2024

No worries - the extra use/testing is helpful. I'll take a look at that too

@mrvollger
Copy link
Author

Awesome thanks!

I was playing with the zoom level because I thought it might fix an issue I am seeing where some data will disappear as I zoom in when using decorator tracks.

e.g.
image
to
image

@jackh726
Copy link
Owner

Ah that's weird, if you can get a reproduction, I can look into it.

@mrvollger
Copy link
Author

mrvollger commented Aug 14, 2024

Thanks, I made a test case here that reproduces the issue in this region of hg38 chrX:47191183-47195924, and uploaded it here:
https://s3-us-west-2.amazonaws.com/stergachis-public1/index.html?prefix=Mitchell/tests/bigtools-test/
And within that dir is a "hub.txt" that you can load on UCSC to see the issue:
image
When making the small test case, I did find that depending on the size of the dataset, I sometimes would or wouldn't see the issue pop up. This was the smallest file I could get where the issue appeared.

(and my commands are in run.sh)

@mrvollger
Copy link
Author

Just wondering if you got a chance to look at this? Thanks in advance!

@jackh726
Copy link
Owner

Sorry, this fell off my radar. I'm looking at this now - will let you know.

@jackh726
Copy link
Owner

So, I ran out of time looking at this for now, but here are my general thoughts:

It doesn't seem like this is an issue with the data that is written. I.e. it doesn't look like there's a bug that is causing some data to not be written to the final bigBed.

Rather, I think this is more likely a quark fundamentally attributed to zooms. My hunch is that because you're writing the bigBed from stdin, the heuristics for calculating/including zoom levels just has a poor interaction here. And really, I think it all comes down to the decoration bigBed. I think likely the calculated zoom sizes are just too big.

The easiest check here is just write the declaration bigBed with nzooms=0, which should force the genome browser to use the primary data. I have wanted to add an option to be a little bit more explicit about what zooms to write, so this might be the push to do that. I expect you would then want to manually set the initial zoom to be small.

@jackh726
Copy link
Owner

I'll try to get back to this maybe later today or this week and test out my theory about writing the decoration bigBed with no zooms (if you don't get to it).

@mrvollger
Copy link
Author

mrvollger commented Sep 10, 2024

Thanks for starting to look into this!

To help out a little I just tested creating from stdin and from a pre-made bed file, and both with a variety of zoom levels (including 0), and unfortunately none of the combinations made these items visible on UCSC.

I extended the script and trackhub so that it also contains a ucsc test just to confirm that it works with the OG tooling:
image

So I don't think it is something I can fix by forcing the zoom level.

(reminder: I am using bigtools 0.5.1 since 0.5.2 since the --autosql argument seems to be ignored in this case with 0.5.2 #48 (comment)).

@jackh726
Copy link
Owner

Thanks, this is helpful. I'll keep looking into it, but I'm really not sure!

@mrvollger
Copy link
Author

Just being a squeaky wheel. Thanks in advance!

@jackh726
Copy link
Owner

Sorry! Gotten back to this a bit. Still looking, but it's definitely a zoom-related issue with the decoration track.

Using the ucsc read bigBed with the bigtools decoration bigBed results in the same pattern that you see above (no zooms). Manually specifying 4 zooms (haven't yet added the ability to specify zoom sizes, so the default sizes are 40,160,640,2560) actually results in no decorated items showing. Next step here is for me to manually specify the zooms and see if that fixes it.

As a side point, I'm noting that the basesCovered summary item (and the others) are not the same across ucsc and bigtools, which is probably unrelated, but interesting. itemCount is the same (and calling bigBedToBed on both results in the same output).

@mrvollger
Copy link
Author

Just making a note that I am still interested in this.

I'd very much like to move this into our production pipeline since it is a fantastic speed up.

Thanks again for the tool!

@jackh726
Copy link
Owner

Gah! Sorry! I keep forgetting to get back to this... really sorry about the slow progress here.

@jackh726
Copy link
Owner

So, have been investigating this a bit further. And, well, I'm perplexed.

To recap: previously, making a hub with the ucsc data file and bigtools decoration file did not work, but ucsc decoration file and bigtools data file did work.

I added the functionality to manually specify zooms (instead of bigtools calculating them, which is a bit different than ucsc), tried that and it didn't work. So, these none of 0 zooms, 4 zooms (different), or 4 matched zooms work.

As a side point, I'm noting that the basesCovered summary item (and the others) are not the same across ucsc and bigtools, which is probably unrelated, but interesting. itemCount is the same (and calling bigBedToBed on both results in the same output).

I then looked into this, in case it's related. Which ended up being it's own rabbit hole. Essentially, rather than focusing on basesCovered, I looked at maxDepth. I made a smaller file (dec.cut1-3.maxdebug.bed.txt) that differed between ucsc and bigtools (ucsc says maxDepth as 200, bigtools says 185). I did a separate pileup with bedtools (bedtools genomecov -bg -g hg38.fai -i dec.cut1-3.maxdebug.bed > dec.cut1-3.maxdebug.pileup.bed) and that shows the max depth as 185...so it seems like a probably bug in the ucsc bedToBigBed, or I'm just missing something.

Along the way, I tried making a bigBed with a single zoom of 1 bp, and realized there was a bug in bigtools (essentially, when there are overlapping regions, I was not counting past the first region). I fixed this, but it still doesn't fix the decoration problem (which, is not that surprising given that no zooms also doesn't work).

So...still working on it.

(I have a testing branch here with ongoing fixes: https://github.com/jackh726/bigtools/tree/dec-testing2)

@jackh726
Copy link
Owner

Okay next weird thing - subsetting the input bed file where the fourth column == "m6A" seems to result in identical outputs...

@mrvollger
Copy link
Author

Thanks for starting to investigate!

I will say the second thing doesn't surprise me that much. For example, depending on the size of the region I was making for the test case different fibers in different regions disappeared.

Would it be helpful to ask UCSC anything? I know a person or two on the browser team.

@jackh726
Copy link
Owner

Yeah, at this point its probably worth asking the ucsc folks.

I've tried "minimizing" the decoration file in different ways prior to running it through bigtools (subsetting to only m6A, removing FIRE lines, only including regions overlapping the target window) and each time i can't reproduce the problem in the genome browser. Subsetting the data file to one entry that is missing the decorations and one that isn't does reproduce the problem though.

I might have some time to poke at this a bit more today. I'm really just perplexed because it doesn't seem like a file format issue (given that just subsetting the initial bed file has yet to reproduce the issue) and some of the decorations are correct even in the reproduction...

@mrvollger
Copy link
Author

I reached out the to creator of the decorator tracks and linked the issue and asked for any advise he might have. I do know he's has since started working on his masters degree so not sure he'll have time to look.

Did I understand you correctly in that you were able to get a single read version that recreates the issue? If so I'd add it to the track hub if you could share the file.

@mrvollger
Copy link
Author

I did have one idea. There is an option in ucsc bedtobigbed that allows a 1bp overlap between blocks in bed12 files which is something I take advantage of with the m6a track specifically. Is that something that might hit some corner case in bigtools?

@jackh726
Copy link
Owner

My test hub is here: https://users.abdenlab.org/hueyj/bigtools-test/hub.txt

re. 1bp overlap (#49?) I think that makes no difference - ucsc does some checks that needs that flag, and bigtools doesn't do those

@jackh726
Copy link
Owner

For my hub (as it is right now, might be down - we're having server troubles):

  • the "bigtools-test" track is only two regions in the "read" bed file, one missing decorations in the repro, one fine
  • the "bigtools-test-sub" track is the same as the above, but the decoration bed file was subsetted to only include entries matching the above two regions
  • can ignore the "bigtools-test-4zooms" and "bigtools-test-matchedzooms" tracks
  • the "ucsc-test" track is the same two regions in the "read" bed file, with the full decoration bed file passed through ucsc bedToBigBed
  • the "bigtools-test-matchedzooms-m6A" and "ucsc-test-m6A" tracks are like "bigtools-test" and "ucsc-test", but the decoration files are filtered to only have '$4 == "m6A"'

@jackh726
Copy link
Owner

I was starting to dig through the kent genome browser code to see exactly how the decorator tracks are loaded/processed to see if there's something special happening that relies on exact bigBed layout or something, then we started having server trouble.

@jackh726
Copy link
Owner

jackh726 commented Nov 4, 2024

Btw, I just published version 0.5.3, which should fix #48 (comment) and a few other things I found when testing this. It doesn't fix the main issue though, which I unfortunately still haven't found the cause of.

@mrvollger
Copy link
Author

Great, I will update to that version for the other fixes!

One more corner case I have thought of for this data. These bed12 files do sometimes have a zero-length "block/exon". Do you think that could be causing an issue in bigtools?

@mrvollger
Copy link
Author

Nevermind, on the zero length. That is not something I am doing here, got confused with another tool I am working on

@jackh726 jackh726 changed the title Missing fields when using -a with bedtobigbed Missing decoration items with UCSC genome browser Nov 30, 2024
@jackh726
Copy link
Owner

Went ahead and renamed the issue to reflect the current state of things - figured there's enough background that even it's better to just keep it here than split it out (even though the original issue is solved)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants