fix: Correct tensor size calculation in TensorSerializer #65

harubaru · 2023-11-15T01:17:53Z

This PR replaces the original tensor size calculation for bulk writing which summed the sizes of each tensor's UntypedStorage size with an approach that calculates the tensor size based off of the number of elements and the size of each elements.

The issue that the original tensor size calculation prevented was that it did not account for views or shared storages. For example, Megatron has a main_grad tensor in each parameter which is a view from a larger buffer where all of the main_grad objects share the same storage object. With the previous method on a 6B model, the original size calculation returns ~6GiB for each main_grad tensor despite the actual tensor being a fraction of the size. In the end, it attempted to write 1.3TiB of data for all of the main_grad tensors.

The new method ensures that the tensor size is calculated based from the actual number of elements (tensor.nelement()) multiplied against the size of each element (tensor.element_size()) which skirts around the overallocation issue by calculating tensor size from a storage that could have been potentially shared.

fix: Correct tensor size calculation in TensorSerializer

bdf659a

harubaru requested review from wbrown and Eta0 November 15, 2023 01:17

wbrown approved these changes Nov 15, 2023

View reviewed changes

wbrown merged commit 94db1fd into main Nov 15, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Correct tensor size calculation in TensorSerializer #65

fix: Correct tensor size calculation in TensorSerializer #65

harubaru commented Nov 15, 2023

fix: Correct tensor size calculation in TensorSerializer #65

fix: Correct tensor size calculation in TensorSerializer #65

Conversation

harubaru commented Nov 15, 2023