Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize _recreate_raw_dataset to use less memory; add versions method #300

Merged

Conversation

peytondmurray
Copy link
Collaborator

This PR optimizes a part of the _recreate_raw_dataset that currently can allocate significant memory for datasets with large numbers of versions. I also saw a large speedup in this section of the code post-refactor.

A VersionedHDF5File.versions method was added (a VersionedHDF5 should know what versions it has), which should return the same thing as versioned_hdf5.versions.all_versions but be much less verbose and more convenient.

Partially addresses #298. Deleting datasets can still be costly in terms of memory because the operation requires

  1. Reconstructing the raw dataset, which entails keeping track of the raw chunks that need to be kept
  2. Reconstructing the hashtable, which requires loading the inverse of the old hashtable as a new one gets built
  3. Recreating the virtual datasets for the versions of the dataset we are keeping

@peytondmurray peytondmurray requested a review from ArvidJB January 8, 2024 21:14
@peytondmurray
Copy link
Collaborator Author

This change is low risk; merging now.

@peytondmurray peytondmurray merged commit 3832979 into deshaw:master Jan 29, 2024
7 checks passed
@peytondmurray peytondmurray deleted the optimize-recreate-raw-dataset branch January 29, 2024 01:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant