-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory spike when loading .mib data lazily #266
Comments
@emichr can you try the same thing with chunks of (16,16,256,256)? |
@CSSFrancis The peak got a little lower, but it is still significant (the time is also significant): I just tested on a laptop and there are no issues. So it might be an issue with the cluster (or the environment of course). The environment .yml file from the HPC looks like this: It might be an issue better discussed with the particular HPC infrastructure, but I guess other users might get into this issue down the line. |
@emichr One thing you might try is to first load the data not using a distributed scheduler and then save it as a zarr import zarr
import hyperspy.api as hs
s =hs.load("../../Downloads/FeAl_stripes.hspy", lazy=True)
s_zip2 = zarr.ZipStore("FeAl_stripes.zspy")
s.save(s_zip2) Or you can always just zip the data after as per the note in the documentation: https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.ZipStore |
Testing this on my own computer, I see exactly the same memory issue as @emichr. Doing this uses a lot of memory: from rsciio.quantumdetector._api import file_reader
data = file_reader("005_test.mib", lazy=True) # data is a dask array Doing this is fast, and uses almost no memory from rsciio.quantumdetector._api import load_mib_data
data = load_mib_data("005_test.mib", lazy=True) # data is a numpy memmap Trying to make it into a dask array is slow and uses a lot of memory: dask_array = da.from_array(data) I tested this on dask version |
Huh... That shouldn't be the case unless somehow the I wonder if this is an upstream bug in |
Seems to be due to a change in dask:
Ergo, it seems like this was introduced in |
@ericpre I assume that I am as I haven't changed any of the defaults in hyperspy or pyxem and just use those packages "out of the box" so to speak. Not too sure how I could check this more thoroughly though, I'm not too much into dask to be frank. |
I'm making a minimal working example for posting it on the dask github, as they might know a bit more about how to resolve this. |
I made an issue about this on the dask github: dask/dask#11152 |
I tested this on a Windows and a Linux computer: same issue on both. |
@sivborg found the So for now, a temporary fix is to downgrade dask to
|
Even if this seems to be irrelevant here, for completeness, I will elaborate on my previous comment: as you mentioned that this was happening in a cluster, I assume that you were using the distributed scheduler - otherwise, it will not scale! There is briefly mentioned in the user guide but there are more details in the dask documentation. |
Just adding on to that. If you wanted to load .mib files using memmap and the distributed backend you would have to adjust how the data is loaded: This part of the dask documentation describes how to do that: https://docs.dask.org/en/latest/array-creation.html?highlight=memmap#memory-mapping |
Yesterday, I had a quick go at implementing #162 for the mib reader - it was in the back of mind that this is something that need to be done. Anyway, I got stuck with the structured dtype used in the mib reader, because it messed up with chunks and shape... |
This seems to have been fixed in the current development version of Not sure when the next |
Dask usually releases on Friday every 2 weeks. The next one should be on the 14th June. |
I checked quickly and this is fixed using the main branch of dask. |
Describe the bug
with When loading large .mib data with
hs.load("data.mib", lazy=True)
, RSS memory spikes more than size of dataset. For instance, when loading a 31.3 GB dataset without specifying any chunking, the RSS memory spikes at about 95 GB. With different chunking, the memory spike changes, but is still a problem (e.g. RSS spikes at 73 GB with a(64, 64, 64, 64)
chunking of a(800, 320|256,256)
uint16 dataset). This figure shows the RSS memory usage as a function of time when runninghs.load(..., lazy=True, chunks=(64, 64, 64,64))
:To Reproduce
Steps to reproduce the behavior:
Expected behavior
The RSS memory requirement shouldn't exceed the dataset size on disk, and should be much lower when loaded lazily.
Python environement:
Additional context
The problem was encountered on a HPC cluster, but I assume the problem will persist on other machines.
The text was updated successfully, but these errors were encountered: