Skip to content

Commit

Permalink
Prepare for version 2.0 (#406)
Browse files Browse the repository at this point in the history
  • Loading branch information
crusaderky authored Dec 5, 2024
1 parent 2f979bb commit 7587b7e
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 8 deletions.
33 changes: 32 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,38 @@
Versioned HDF5 Change Log
=========================

## 1.8.1 (2024-11-21)
## 2.0.0 (2024-12-05)

### Major Changes

- `stage_dataset` has been reimplemented from scratch. The new engine is
expected to be much faster in most cases. [Read more here](staged_changes).
- `__getitem__` on staged datasets used to never cache data when reading from
unmodified datasets (before the first call to `__setitem__` or `resize()`) and
used to cache the whole loaded area on modified datasets (where the user had
previously changed a single point anywhere within the same staged version).

This has now been changed to always use the libhdf5 cache. As such cache is very
small by default, users on slow disk backends may observe a slowdown in
read-update-write use cases that don't overwrite whole chunks, e.g. `ds[::2] += 1`.
They should experiment with sizing the libhdf5 cache so that it's larger than the
work area, e.g.:

```python
with h5py.File(path, "r+", rdcc_nbytes=2**30, rdcc_nslots=100_000) as f:
vf = VersionedHDF5File(f)
with vf.stage_version("r123") as sv:
sv["some_ds"][::2] += 1
```

(this recommendation applies to plain h5py datasets too).

Note that this change exclusively impacts `stage_dataset`; `current_version`,
`get_version_by_name`, and `get_version_by_timestamp` are not impacted and
continue not to cache anything regardless of libhdf5 cache size.
- Added support for Ellipsis (...) in indexing.

## 1.8.2 (2024-11-21)

### Major Changes

Expand Down
16 changes: 9 additions & 7 deletions docs/staged_changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ The destination array (always an actual ``numpy.ndarray``) can be either:
Plans
-----
To encapsulate the complex decision-making logic of the ``StagedChangesArray`` methods,
the actual methods of the class are designed as a fairly dumb wrappers which
the actual methods of the class are designed as fairly dumb wrappers which

1. create a ``*Plan`` class with all the information needed to execute the operation
(``GetItemPlan`` for ``__getitem__()``, ``SetItemPlan`` for ``__setitem__()``, etc.);
Expand Down Expand Up @@ -176,16 +176,16 @@ All plans share a similar workflow:
5. Sort by ``slab_indices`` and partition along them. This is to break the rest of the
algorithm into separate calls to ``read_many_slices()``, one per pair of source and
destination slabs. Note that a transfer operation is always from N slabs to 1 slab
or to the ``__getitem__`` return value, of from 1 slab or the ``__setitem__`` value
parameter to N slabs, and that ``slab_indices`` can mean either source or destination
or to the ``__getitem__`` return value, or from 1 slab or the ``__setitem__`` value
parameter to N slabs, and that the slab index can mean either source or destination
depending on context.

6. For each *(chunk index, slab index, slab offset)* triplet from the above lists, query
the ``IndexChunkMapper``'s again, independently for each axis, to convert the global
n-dimensional index of points that was originally provided by the user to a local
index that only impacts the chunk. For each axis, this will return:

- exactly one 1-dimensional slice pair, in case of basic indices (scalars of slices);
- exactly one 1-dimensional slice pair, in case of basic indices (scalars or slices);
- one or more 1-dimensional slice pairs, in case of advanced indices (arrays of
indices or arrays of bools).

Expand Down Expand Up @@ -216,7 +216,7 @@ and then transfers from each slab into it.
**There is no cache on read**: calling the same index twice will result in two separate
reads to the base slabs, which typically translates to two calls to
``h5py.Dataset.__getitem__`` and two disk accesses. However, note that the HDF5 C
library does perform *some* caching of its own.
library features its own caching, configurable via ``rdcc_nbytes`` and ``rdcc_nslots``.

For this reason, this method never modifies the state of the ``StagedChangesArray``.

Expand Down Expand Up @@ -274,8 +274,10 @@ The ``SetItemPlan`` thus runs the general algorithm twice:
1. With a mask that picks the chunks that lie either on full of base slabs, intersected
with the mask of partially selected chunks. These chunks are moved to the staged
slabs.
2. Without any mask, as now all chunks lie on staged slabs. These chunks are copied from
the ``__setitem__`` value parameter.
2. Without any mask, as now all chunks either lie on staged slabs or are wholly selected
by the update; in the latter case ``__setitem__`` creates a new slab with ``numpy.empty``
and appends it to ``StagedChangesArray.slabs``.
The updated surfaces are then copied from the ``__setitem__`` value parameter.


``resize()`` algorithm
Expand Down

0 comments on commit 7587b7e

Please sign in to comment.