Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Node-Multi-GPU Tutorial #8071

Merged
merged 44 commits into from
Oct 24, 2023
Merged

Multi-Node-Multi-GPU Tutorial #8071

merged 44 commits into from
Oct 24, 2023

Conversation

puririshi98
Copy link
Contributor

@puririshi98 puririshi98 commented Sep 22, 2023

ready for review

@puririshi98 puririshi98 changed the title Draft: Multi-Node-Multi-GPU Tutorial Multi-Node-Multi-GPU Tutorial Sep 25, 2023
@puririshi98
Copy link
Contributor Author

@akihironitta @rusty1s lmk if anything else needed to merge

CHANGELOG.md Outdated Show resolved Hide resolved
docs/source/tutorial/multi_gpu_vanilla.rst Outdated Show resolved Hide resolved
docs/source/tutorial/multi_node_multi_gpu_tutorial.rst Outdated Show resolved Hide resolved
docs/source/tutorial/multi_node_multi_gpu_tutorial.rst Outdated Show resolved Hide resolved
@rusty1s rusty1s enabled auto-merge (squash) October 24, 2023 05:22
@rusty1s rusty1s merged commit 3cf9d77 into master Oct 24, 2023
12 checks passed
@rusty1s rusty1s deleted the multinode_multigpu_tutorial branch October 24, 2023 05:22
@flxmr
Copy link
Contributor

flxmr commented Oct 25, 2023

Is it only me or is this a very weird style to use Slurm? I got a student being very excited about this "official" tutorial for pyg and it confused him (and me to a lesser degree) a lot.

Specifically:

  1. your SLURM-cluster seems to not have their GPUs managed by GRES? That's a highly unusual setup for anything a random person might find in their slurm setup?
  2. you need environment variables set (e.g. RANK). Probably your container does this?
  3. why would you tie the programming example to the very weird way of using Slurm (an example should use sbatch I'd say)?
  4. the get_num_workers function is also very weird...
    num_workers = len(os.sched_getaffinity(0)) // (2 * world_size)
    You get the number of CPUs in your slurm-task and then divide that by 2x the world_size (aka the total number of tasks you have on multiple (or one, refreshing on the process groups now) nodes). This means that without any slurm settings of your installation this is 0 (--cpus-per-task < 2*world_size)? What's the point, especially for an example?
  5. also: why would you use multiple process groups (and very convoluted code) to just get the local rank... (which SLURM_LOCALID should hold just fine on most systems?)

Maybe @puririshi98 can help me out on this points and I can let my student prep a more generic and slurmy version.

@rusty1s
Copy link
Member

rusty1s commented Oct 26, 2023

Thanks for the feedback. I agree that the tutorial is missing out on some information in this aspect. It would be great to get your student and @puririshi98 to update the confusing parts.

@puririshi98
Copy link
Contributor Author

puririshi98 commented Oct 26, 2023

@flxmr thank you for your concerns. I will reply back soon w/ more details after discussing internally at NVIDIA. In general this was based on what works on NVIDIA clusters w/ our NVIDIA container. After I follow up w/ detailed answers to each of your points, I would be happy to discuss ways forward.

@flxmr
Copy link
Contributor

flxmr commented Oct 27, 2023

Yes, we will prepare something the upcoming week. Our own cluster doesn't have any container-integration, but I found an university-available array of DGX-systems which have, so we can also integrate the container instructions.

@puririshi98
Copy link
Contributor Author

puririshi98 commented Oct 30, 2023

I will make a quick PR for some cleanups:

  1. I will remove get_local process group and create_local_process_gropu functions for simplicity and use int(os.environ['LOCAL_RANK'])) for simplicity. These functions were being used in an example for more complicated multi-node setups.
  2. I will remove the line that sets env variable NVSHMEM_SYMMETRIC_SIZE, again this was for different usage. this is for using nvshmem for feature store
  3. update the suggested slurm command to:
    srun -l -N<num_nodes> --ntasks-per-node=<ngpu_per_node>
    --container-name=cont --container-image=<image_url> --container-mounts=/ogb-papers100m/:/workspace/dataset
    python3 path_to_script.py
  4. I'll remove --ngpu-per-node as well since I will access env by LOCAL_RANK.

(note we dont use GRES at nvidia but if you do, you'd need to add --gpus-per-node to your launch script

also for get_num_workers function please suggest a better alternative hueristic, I am just using a hueristic that has worked for me on other examples and has given me good perf but im very open to suggestions

@puririshi98
Copy link
Contributor Author

#8292
here it is

@puririshi98
Copy link
Contributor Author

@flxmr, curious what additional changes would you like to see after my PR gets merged?

@flxmr
Copy link
Contributor

flxmr commented Nov 9, 2023

So was in vacation. The get_num_workers-heuristic was fine I'd argue, but adding the world_size to the equation (you didn't do that - it was the renaming action from one to 2 examples) isn't (for SLURM and multinode usage).
os.sched_getaffinity returns the set of cpus in the PID's cpuset (at least it was called like that in cgroups v1 I think). For a normal process outside SLURM this is the number of cpus in the system and you want to divide this by the numbers of training processes (which is the world-size) to get available cpus per process (and then divide by 2 for the dataloader internal heuristic...). Within SLURM this is --cpus-per-task and thus you should't include world-size in the division.

I also wonder, on which platform with pyg and CUDA os.sched_getaffinity is not available and then you have to fallback to os.cpu_count...

rusty1s pushed a commit that referenced this pull request Nov 14, 2023
address comments from discussion at end of
#8071
puririshi98 added a commit that referenced this pull request Nov 16, 2023
So, in response to my remarks in #8071 I now prepared this PR for
updating the multi-node documentation (sadly no student contrib, they
can prep multi-gpu metrics).

Reasoning:
- if pyg has a tutorial on DDP it's for people who essentially grow
within pyg to this usecase. This should be reflected in telling them
more about this and be clear on what is done (like
`torch.multiprocessing` injecting the rank. it seems very random
otherwise).
- in the same vein, the multi-node tutorial should evolve from the
single-node one imho.
- embedding the pyxis-container as the only way to do things is good for
Nvidia, but bad for people who arrive on a system without that.

Things I skipped:
- setting up the worker-count. I would say people should just figure
this out on their own. the `os.sched_getaffinity` is cool, but this is a
very general problem too and I essentially do it manually now.
- anything GRES-related. I doubt that anyone growing into this and
installing slurm on their computers will not grok this. Every
multi-user, public research system has GRES I'd say (and everyone has
their internal docs).

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Akihiro Nitta <[email protected]>
Co-authored-by: Rishi Puri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants