hdf5 files are opened too often in DataContainer when loading SPHInX jobs #1574

freyso · 2024-08-29T17:14:04Z

Test example:

import pyiron
import pyiron_base.storage.hdfio

# --- replace pyiron_base.storage.hdfio.FileHDFio._open_hdf by wrapper to count calls
def _open_hdf(filename,mode="r",swmr=False):
    import h5io_browser.base
    global count_open_calls
    count_open_calls += 1
    return h5io_browser.base._open_hdf(filename,mode,swmr)
setattr(pyiron_base.storage.hdfio,"_open_hdf",_open_hdf)
count_open_calls=0

pr = pyiron.Project ("test")
bulk_str = pr.create.structure.bulk ("TiC", crystalstructure="rocksalt", a=4.33)
dft = pr.create.job.Sphinx ("bulk-dummy")
dft.structure = bulk_str
dft.input['Xcorr']='PBE'
dft.set_encut (450)
dft.set_kpoints (3*[4])
print(count_open_calls)
dft.calc_static ()
print(count_open_calls)

dft.run (delete_existing_job=True)
print(count_open_calls)

pr.load ("bulk-dummy")
print(count_open_calls)

I get 269 _open_hdf calls for the run, and 802 calls for the load.

From discussions with @jan-janssen and @pmrv , I get the info that the key challenge here is

one usually does not want to unconditionally load the entire hdf5 content
one does not want to leave the hdf5 file open for too long, due to concurrent accessing e.g. from running jobs
one cannot see at the bottom how long an hdf5 file should be kept

Possible solution

create an hdf5 "open files" cache (dict: name->h5 object) that is checked before opening an h5 (as context) inside the pyiron_base.storage.hdfio
add internal functions to add/remove to the cache
wrap those in a new context object to be used before hdf5-intense operations (say hdf_leave_open)

Use scenario:

from pyiron_base.storage.hdfio import hdf_leave_open

with hdf_leave_open(h5filename):
    # file is kept open within cache in this block
    data_container.recursive_load (h5filename)
# h5file is closed

In this way, high-level code can indicate when it is going to enter and leave hdf5-intense code, and low-level code would be augmented by a single check for the existence of an open file handle before opening the file. Performance-wise, the open call is much more expensive than the dictionary lookup, so no measurable price to pay when the cache is not used.

Using a context should make this rather robust against unexpected errors. The new context manager should check at context enter if the cache is already filled, and if so, do nothing and set a "noop" flag for the context exit. Then, one could even deal with nested contexts for the same file if programmers do not realize that other parts of the code also have caching instructions. If pyiron needs to be thread-safe, a locking mechanism for the cache access is needed.

The text was updated successfully, but these errors were encountered:

pmrv · 2024-08-29T18:57:24Z

Yep, this would the solution I'd like too. Last I tried to prototype it I got bogged down in the details, but in the end I think it would have to be something like this.

samwaseda · 2024-08-29T19:41:27Z

Last I tried to prototype it I got bogged down in the details, but in the end I think it would have to be something like this.

Same for me…

pmrv · 2024-08-29T20:02:25Z

So I did a quick survey and I think it can be implemented in h5io_browsers _open_hdf, but there's two things with the context manager approach that make this slightly more tricky:

What happens if the same file gets opened with different modes in nested calls to the context manager
The returned file handle is used outside of h5io_browser as a context manager, which closes the underlying file. So if there are multiple of such usages (as there are bound to be), the first one would close the file making it unusable for the second.

The first is probably the smallest problem, but I'd have to check where we could hook into to circumvent the second problem.

freyso · 2024-08-30T06:12:30Z

To avoid concurrent efforts: I will make a first draft of this; we can discuss it on Monday.

pmrv · 2024-08-30T07:14:05Z

To avoid concurrent efforts: I will make a first draft of this; we can discuss it on Monday.

Sorry, I couldn't sleep yesterday and I have something almost there now. :')

pmrv · 2024-08-30T07:22:06Z

Main changes to h5io are here, I won't be unhappy if you want to take a look and take it from there, but it would need then strategic placement of the context manager around pyiron_base and maybe even the sphinx class. When I use the context manager in sphinx' to_hdf/from_hdf/collect_output/set_input_read_only, I can save most of the file open that you reported.

samwaseda · 2024-08-30T07:24:23Z

Ok I’m gonna try to work on it by Monday just in order to make it more complicated

jan-janssen mentioned this issue Sep 11, 2024

list_hdf(): Add return_types parameter h5io/h5io_browser#73

Closed

jan-janssen transferred this issue from pyiron/pyiron_base Sep 29, 2024

jan-janssen mentioned this issue Sep 30, 2024

Specify storage usability pyiron/development_ideas#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hdf5 files are opened too often in DataContainer when loading SPHInX jobs #1574

hdf5 files are opened too often in DataContainer when loading SPHInX jobs #1574

freyso commented Aug 29, 2024

pmrv commented Aug 29, 2024

samwaseda commented Aug 29, 2024

pmrv commented Aug 29, 2024

freyso commented Aug 30, 2024

pmrv commented Aug 30, 2024

pmrv commented Aug 30, 2024

samwaseda commented Aug 30, 2024

hdf5 files are opened too often in DataContainer when loading SPHInX jobs #1574

hdf5 files are opened too often in DataContainer when loading SPHInX jobs #1574

Comments

freyso commented Aug 29, 2024

pmrv commented Aug 29, 2024

samwaseda commented Aug 29, 2024

pmrv commented Aug 29, 2024

freyso commented Aug 30, 2024

pmrv commented Aug 30, 2024

pmrv commented Aug 30, 2024

samwaseda commented Aug 30, 2024