-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read speeds decrease 2x when reading with fewer processes #195
Comments
I don't have access to a cluster; is there a local method to run this? I'm trying something like:
Edit: So I hacked your file and replaced some large values to run on my machine with num processes = 1
The tensorstore spec looks something like:
|
Hey Laramie - thanks for taking a look! Unfortunately, I haven't managed to create a smaller repro yet. I'll run with More generally, do you know of any settings I might need to change to increase the per-process throughput? Or failing that, is there a (possibly hacky) way to have separate independent TensorStore clients within a single process? I suspect there's some kind of per-process limit (threadpool, TCP/IP connections, etc) that we hit here. |
At the tensorstore layer this is using an ocdbt kvstore on top of a file kvstore. Try setting "file_io_concurrency", which defaults to max(4, hardware_concurrency). https://en.cppreference.com/w/cpp/thread/thread/hardware_concurrency You could also add detailed logging to the file operations via How many hosts are in your hostfile? And what is the underlying filesystem? |
There's 64 nodes (it says so in the issue description above). The file system is a distributed file system a la Lustre or VAST. I already tried setting |
I don't work on tensorstore directly, but one setting I found helps with loading performance sometimes is the def save(state, path, ocdbt_target_file_size: int = 2 * 1024 ** 3):
start = time.time()
ocp.PyTreeCheckpointer(use_ocdbt=True, use_zarr3=True).save(
path, ocp.args.PyTreeSave(
item=state, ocdbt_target_data_file_size=ocdbt_target_file_size))
log(f"Saved checkpoint to {path} in {time.time() - start:.2f} sec")
def load(path, shape_dtype):
start = time.time()
state = ocp.PyTreeCheckpointer(use_ocdbt=True, use_zarr3=True).restore(
path, ocp.args.PyTreeRestore(
shape_dtype, restore_args=ocp.checkpoint_utils.construct_restore_args(shape_dtype),
))
end = time.time()
log(f"Loaded checkpoint from {path} in {end - start:.2f} sec")
return state 2 GB is the default, but going smaller might help |
I imagine that a lot of the performance will have to do with specific details about how the filesystem interaction happens. If it's related to I would be interested to see the output of the tensorstore counters on for the various configs. Edit: Looking at orbax it appears that It would be nice to get a pprof of these; is that possible? |
Ok, I figured out an inconsistency with our internal build which makes logging hard to use in python. Once I get it added then it will be easier to debug what's going on. |
You should now be able to set this environment variable and look at the io timing across runs:
|
I just submitted a
|
I have been running a variant of this with my updated multi_read_benchmark. We found some internal tensorstore chunk cache contention which may help here. It was alleviated in 5927385 |
The issue
Given a specific checkpoint, load it in two different settings:
What I observe:
The checkpoint in question is also written with 512 processes (see below for repro). Except for the number of processes, nothing else changes (sharding etc. stays the same).
To reproduce.
Download this file and run it in a context with 64 nodes, 8 GPUs each. Make sure
hostfile
has the hostnames of the 64 nodes. (mpirun
isn't essential here, it's just a way to spawn these processes.)To create the checkpoint:
To load the checkpoint with 512 processes:
This takes ~20 sec for me.
To load the checkpoint with 64 processes:
This takes ~40 sec for me.
The issue doesn't seem to be in Orbax because the same happens with a plain
jax.experimental.serialization.async_deserialize
.The text was updated successfully, but these errors were encountered: