Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] earthaccess.download gets different data than curl from an OPeNDAP service #887

Closed
1 task done
itcarroll opened this issue Dec 4, 2024 · 8 comments · Fixed by #920
Closed
1 task done
Labels
type: bug Something isn't working

Comments

@itcarroll
Copy link
Collaborator

Is this issue already tracked somewhere, or is this a new report?

  • I've reviewed existing issues and couldn't find a duplicate for this problem.

Current Behavior

Reporting an issue noted by @tsnow03 on the CryoCloud slack.

Giving earthaccess.download a URL for the LAADS OPeNDAP service (in this case, one that returns a NetCDF4 formatted version of the archival HDF-EOS file) returns a gzipped file. Using curl on the same URL returns an uncompressed file. If it is intended that earthaccess.download get a compressed file, then some notification should be given. If not ...

Expected Behavior

I expect earthaccess.download to download a file identical to what curl downloads for a given URL.

Steps To Reproduce

Show that earthaccess writes a compressed file:

import earthaccess

url = "https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/61/MOD07_L2/2000/055/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4"
earthaccess.download(url, "data")

with open("data/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4", "rb") as f:
    print(f.read(4))
b'\x1f\x8b\x08\x00'

That looks to me like a gzipped file, and passing the file through gunzip does allow it to be opened with netCDF4.

On the other hand, curl writes an uncompressed HDF5 file.

!curl -O "https://ladsweb.modaps.eosdis.nasa.gov/opendap/RemoteResources/laads/allData/61/MOD07_L2/2000/055/MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4"

with open("MOD07_L2.A2000055.0000.061.2017202185924.hdf.nc4", "rb") as f:
    print(f.read(4))
b'\x89HDF'

Environment

- OS: CryoCloud (ubuntu jammy)
- Python: 3.11.9

Additional Context

No response

@mfisher87 mfisher87 added the type: bug Something isn't working label Dec 4, 2024
@mfisher87
Copy link
Collaborator

This is certainly weird. It must be OPeNDAP conditionally compressing the data based on headers or something like that? waves hands

I don't think this is intended.

@itcarroll
Copy link
Collaborator Author

I had originally suspected that something was up with the OPeNDAP service, but LAADS user services response indicates they do not compress on their end. https://forum.earthdata.nasa.gov/viewtopic.php?t=6247

@mfisher87
Copy link
Collaborator

I think we need to debug and compare the HTTP requests! Then we can probably force this to occur with curl as well through trial and error.

@maxrjones
Copy link

maxrjones commented Dec 12, 2024

requests uses gzip for the content-encoding header but automatically decodes compressed content. However, earthaccess bypasses this automatic decoding by using Response.raw in

with open(path, "wb") as f:
# This is to cap memory usage for large files at 1MB per write to disk per thread
# https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content
shutil.copyfileobj(r.raw, f, length=1024 * 1024)
If you want to copy the automatically decoded content to the file instead, you should use Response.content instead of Response.raw but that'd impact the memory usage cap strategy.

@maxrjones
Copy link

The note in this section of the docs actually provides a better explanation of what's causing this issue - https://docs.python-requests.org/en/latest/user/quickstart/#raw-response-content

@itcarroll
Copy link
Collaborator Author

itcarroll commented Dec 12, 2024

Thanks @maxrjones!

The practice recommended in those docs is to use Response.iter_content instead of raw, which also lets us set a chunk size.

Any opposition to testing that switch?

@mfisher87
Copy link
Collaborator

Good find @maxrjones , thank you!

The practice recommended in those docs is to use Response.iter_content instead of raw, which also lets us set a chunk size.

Any opposition to testing that switch?

🚀 🚀 🚀

@mfisher87
Copy link
Collaborator

Thanks @itcarroll for the implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants