Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct tensor size calculation in TensorSerializer #65

Merged
merged 1 commit into from
Nov 15, 2023

Conversation

harubaru
Copy link
Contributor

This PR replaces the original tensor size calculation for bulk writing which summed the sizes of each tensor's UntypedStorage size with an approach that calculates the tensor size based off of the number of elements and the size of each elements.

The issue that the original tensor size calculation prevented was that it did not account for views or shared storages. For example, Megatron has a main_grad tensor in each parameter which is a view from a larger buffer where all of the main_grad objects share the same storage object. With the previous method on a 6B model, the original size calculation returns ~6GiB for each main_grad tensor despite the actual tensor being a fraction of the size. In the end, it attempted to write 1.3TiB of data for all of the main_grad tensors.

The new method ensures that the tensor size is calculated based from the actual number of elements (tensor.nelement()) multiplied against the size of each element (tensor.element_size()) which skirts around the overallocation issue by calculating tensor size from a storage that could have been potentially shared.

@harubaru harubaru requested review from wbrown and Eta0 November 15, 2023 01:17
@wbrown wbrown merged commit 94db1fd into main Nov 15, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants