Multi-Node-Multi-GPU Tutorial #8071

puririshi98 · 2023-09-22T22:48:13Z

ready for review

for more information, see https://pre-commit.ci

…team/pytorch_geometric into multinode_multigpu_tutorial

for more information, see https://pre-commit.ci

…team/pytorch_geometric into multinode_multigpu_tutorial

puririshi98 · 2023-10-17T16:50:57Z

@akihironitta @rusty1s lmk if anything else needed to merge

CHANGELOG.md

docs/source/tutorial/multi_gpu_vanilla.rst

docs/source/tutorial/multi_node_multi_gpu_tutorial.rst

for more information, see https://pre-commit.ci

flxmr · 2023-10-25T15:39:13Z

Is it only me or is this a very weird style to use Slurm? I got a student being very excited about this "official" tutorial for pyg and it confused him (and me to a lesser degree) a lot.

Specifically:

your SLURM-cluster seems to not have their GPUs managed by GRES? That's a highly unusual setup for anything a random person might find in their slurm setup?
you need environment variables set (e.g. RANK). Probably your container does this?
why would you tie the programming example to the very weird way of using Slurm (an example should use sbatch I'd say)?
the get_num_workers function is also very weird...

pytorch_geometric/examples/multi_gpu/multinode_multigpu_papers100m_gcn.py

Line 46 in 3e55a4c

num_workers = len(os.sched_getaffinity(0)) // (2 * world_size)

You get the number of CPUs in your slurm-task and then divide that by 2x the world_size (aka the total number of tasks you have on multiple (or one, refreshing on the process groups now) nodes). This means that without any slurm settings of your installation this is 0 (--cpus-per-task < 2*world_size)? What's the point, especially for an example?
also: why would you use multiple process groups (and very convoluted code) to just get the local rank... (which SLURM_LOCALID should hold just fine on most systems?)

Maybe @puririshi98 can help me out on this points and I can let my student prep a more generic and slurmy version.

rusty1s · 2023-10-26T18:50:54Z

Thanks for the feedback. I agree that the tutorial is missing out on some information in this aspect. It would be great to get your student and @puririshi98 to update the confusing parts.

puririshi98 · 2023-10-26T22:44:10Z

@flxmr thank you for your concerns. I will reply back soon w/ more details after discussing internally at NVIDIA. In general this was based on what works on NVIDIA clusters w/ our NVIDIA container. After I follow up w/ detailed answers to each of your points, I would be happy to discuss ways forward.

flxmr · 2023-10-27T09:39:30Z

Yes, we will prepare something the upcoming week. Our own cluster doesn't have any container-integration, but I found an university-available array of DGX-systems which have, so we can also integrate the container instructions.

puririshi98 · 2023-10-30T16:46:36Z

I will make a quick PR for some cleanups:

I will remove get_local process group and create_local_process_gropu functions for simplicity and use int(os.environ['LOCAL_RANK'])) for simplicity. These functions were being used in an example for more complicated multi-node setups.
I will remove the line that sets env variable NVSHMEM_SYMMETRIC_SIZE, again this was for different usage. this is for using nvshmem for feature store
update the suggested slurm command to:
srun -l -N<num_nodes> --ntasks-per-node=<ngpu_per_node>
--container-name=cont --container-image=<image_url> --container-mounts=/ogb-papers100m/:/workspace/dataset
python3 path_to_script.py
I'll remove --ngpu-per-node as well since I will access env by LOCAL_RANK.

(note we dont use GRES at nvidia but if you do, you'd need to add --gpus-per-node to your launch script

also for get_num_workers function please suggest a better alternative hueristic, I am just using a hueristic that has worked for me on other examples and has given me good perf but im very open to suggestions

puririshi98 · 2023-10-30T17:22:21Z

#8292
here it is

puririshi98 · 2023-10-30T23:07:14Z

@flxmr, curious what additional changes would you like to see after my PR gets merged?

flxmr · 2023-11-09T14:29:21Z

So was in vacation. The get_num_workers-heuristic was fine I'd argue, but adding the world_size to the equation (you didn't do that - it was the renaming action from one to 2 examples) isn't (for SLURM and multinode usage).
os.sched_getaffinity returns the set of cpus in the PID's cpuset (at least it was called like that in cgroups v1 I think). For a normal process outside SLURM this is the number of cpus in the system and you want to divide this by the numbers of training processes (which is the world-size) to get available cpus per process (and then divide by 2 for the dataloader internal heuristic...). Within SLURM this is --cpus-per-task and thus you should't include world-size in the division.

I also wonder, on which platform with pyg and CUDA os.sched_getaffinity is not available and then you have to fallback to os.cpu_count...

address comments from discussion at end of #8071

So, in response to my remarks in #8071 I now prepared this PR for updating the multi-node documentation (sadly no student contrib, they can prep multi-gpu metrics). Reasoning: - if pyg has a tutorial on DDP it's for people who essentially grow within pyg to this usecase. This should be reflected in telling them more about this and be clear on what is done (like `torch.multiprocessing` injecting the rank. it seems very random otherwise). - in the same vein, the multi-node tutorial should evolve from the single-node one imho. - embedding the pyxis-container as the only way to do things is good for Nvidia, but bad for people who arrive on a system without that. Things I skipped: - setting up the worker-count. I would say people should just figure this out on their own. the `os.sched_getaffinity` is cool, but this is a very general problem too and I essentially do it manually now. - anything GRES-related. I doubt that anyone growing into this and installing slurm on their computers will not grok this. Every multi-user, public research system has GRES I'd say (and everyone has their internal docs). --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Rishi Puri <[email protected]>

wip

e4687c1

puririshi98 requested a review from rusty1s as a code owner September 22, 2023 22:48

puririshi98 self-assigned this Sep 22, 2023

github-actions bot added the documentation label Sep 22, 2023

pre-commit-ci bot and others added 4 commits September 22, 2023 22:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

ab1a24a

for more information, see https://pre-commit.ci

wip

d71728e

Merge branch 'multinode_multigpu_tutorial' of https://github.com/pyg-…

090aa6d

…team/pytorch_geometric into multinode_multigpu_tutorial

wip

fd281ad

rusty1s added feature 1 - Priority P1 labels Sep 23, 2023

wip

9331b92

puririshi98 requested a review from wsad1 as a code owner September 25, 2023 17:04

github-actions bot added the example label Sep 25, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

99c2aa6

for more information, see https://pre-commit.ci

puririshi98 mentioned this pull request Sep 25, 2023

Multi-node+multi-GPU papers100m+GCN example #8070

Merged

puririshi98 added 3 commits September 25, 2023 12:23

wip

671b79c

draft done

b4ee6a5

Merge branch 'master' into multinode_multigpu_tutorial

4b0d4f3

puririshi98 changed the title ~~Draft: Multi-Node-Multi-GPU Tutorial~~ Multi-Node-Multi-GPU Tutorial Sep 25, 2023

puririshi98 and others added 11 commits September 25, 2023 16:40

draft done

4ac34c3

draft done

d16f7a7

draft done

ea24757

[pre-commit.ci] auto fixes from pre-commit.com hooks

cef4340

for more information, see https://pre-commit.ci

draft done

900a6a1

[pre-commit.ci] auto fixes from pre-commit.com hooks

bf88252

for more information, see https://pre-commit.ci

draft done

d5342f1

Merge branch 'multinode_multigpu_tutorial' of https://github.com/pyg-…

7b03492

…team/pytorch_geometric into multinode_multigpu_tutorial

draft done

05bd132

draft done

2965941

draft done

a99f7b2

puririshi98 added 2 commits October 18, 2023 10:26

Merge branch 'master' into multinode_multigpu_tutorial

1355f72

Merge branch 'master' into multinode_multigpu_tutorial

8931e4d

akihironitta reviewed Oct 18, 2023

View reviewed changes

akihironitta and others added 10 commits October 19, 2023 04:16

Update CHANGELOG.md

5527236

Update docs/source/tutorial/multi_gpu_vanilla.rst

fe11dcc

addressing akihiro's comments

60a39ff

[pre-commit.ci] auto fixes from pre-commit.com hooks

561640d

for more information, see https://pre-commit.ci

Merge branch 'master' into multinode_multigpu_tutorial

e19c2c5

Merge branch 'master' into multinode_multigpu_tutorial

731454c

Merge branch 'master' into multinode_multigpu_tutorial

b7fec85

Merge branch 'master' into multinode_multigpu_tutorial

cf196b6

Merge branch 'master' into multinode_multigpu_tutorial

9f80b8a

update

b9f0512

rusty1s approved these changes Oct 24, 2023

View reviewed changes

rusty1s enabled auto-merge (squash) October 24, 2023 05:22

rusty1s merged commit 3cf9d77 into master Oct 24, 2023
12 checks passed

rusty1s deleted the multinode_multigpu_tutorial branch October 24, 2023 05:22

flxmr mentioned this pull request Nov 9, 2023

Improve multi-node multi-gpu tutorial #8353

Merged

puririshi98 mentioned this pull request Nov 13, 2023

Cleanup multinode papers100m get_num_workers #8368

Merged

rusty1s pushed a commit that referenced this pull request Nov 14, 2023

Cleanup multinode papers100m get_num_workers (#8368)

88d7986

address comments from discussion at end of #8071

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Node-Multi-GPU Tutorial #8071

Multi-Node-Multi-GPU Tutorial #8071

puririshi98 commented Sep 22, 2023 •

edited

Loading

puririshi98 commented Oct 17, 2023

flxmr commented Oct 25, 2023 •

edited

Loading

rusty1s commented Oct 26, 2023

puririshi98 commented Oct 26, 2023 •

edited

Loading

flxmr commented Oct 27, 2023

puririshi98 commented Oct 30, 2023 •

edited

Loading

puririshi98 commented Oct 30, 2023

puririshi98 commented Oct 30, 2023

flxmr commented Nov 9, 2023 •

edited

Loading

Multi-Node-Multi-GPU Tutorial #8071

Multi-Node-Multi-GPU Tutorial #8071

Conversation

puririshi98 commented Sep 22, 2023 • edited Loading

puririshi98 commented Oct 17, 2023

flxmr commented Oct 25, 2023 • edited Loading

rusty1s commented Oct 26, 2023

puririshi98 commented Oct 26, 2023 • edited Loading

flxmr commented Oct 27, 2023

puririshi98 commented Oct 30, 2023 • edited Loading

puririshi98 commented Oct 30, 2023

puririshi98 commented Oct 30, 2023

flxmr commented Nov 9, 2023 • edited Loading

puririshi98 commented Sep 22, 2023 •

edited

Loading

flxmr commented Oct 25, 2023 •

edited

Loading

puririshi98 commented Oct 26, 2023 •

edited

Loading

puririshi98 commented Oct 30, 2023 •

edited

Loading

flxmr commented Nov 9, 2023 •

edited

Loading