Optimize `_recreate_raw_dataset` to use less memory; add versions method #300

peytondmurray · 2024-01-08T21:14:42Z

This PR optimizes a part of the _recreate_raw_dataset that currently can allocate significant memory for datasets with large numbers of versions. I also saw a large speedup in this section of the code post-refactor.

A VersionedHDF5File.versions method was added (a VersionedHDF5 should know what versions it has), which should return the same thing as versioned_hdf5.versions.all_versions but be much less verbose and more convenient.

Partially addresses #298. Deleting datasets can still be costly in terms of memory because the operation requires

Reconstructing the raw dataset, which entails keeping track of the raw chunks that need to be kept
Reconstructing the hashtable, which requires loading the inverse of the old hashtable as a new one gets built
Recreating the virtual datasets for the versions of the dataset we are keeping

peytondmurray · 2024-01-29T01:56:37Z

This change is low risk; merging now.

Optimize _recreate_raw_dataset to use less memory; add versions API

8290f08

peytondmurray requested a review from ArvidJB January 8, 2024 21:14

peytondmurray merged commit 3832979 into deshaw:master Jan 29, 2024
7 checks passed

peytondmurray deleted the optimize-recreate-raw-dataset branch January 29, 2024 01:56

peytondmurray mentioned this pull request Jan 29, 2024

delete_versions high memory usage (PyInf#11623) #298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `_recreate_raw_dataset` to use less memory; add versions method #300

Optimize `_recreate_raw_dataset` to use less memory; add versions method #300

peytondmurray commented Jan 8, 2024

peytondmurray commented Jan 29, 2024

Optimize _recreate_raw_dataset to use less memory; add versions method #300

Optimize _recreate_raw_dataset to use less memory; add versions method #300

Conversation

peytondmurray commented Jan 8, 2024

peytondmurray commented Jan 29, 2024

Optimize `_recreate_raw_dataset` to use less memory; add versions method #300

Optimize `_recreate_raw_dataset` to use less memory; add versions method #300