-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Node-Multi-GPU Tutorial #8071
Conversation
for more information, see https://pre-commit.ci
…team/pytorch_geometric into multinode_multigpu_tutorial
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…team/pytorch_geometric into multinode_multigpu_tutorial
@akihironitta @rusty1s lmk if anything else needed to merge |
for more information, see https://pre-commit.ci
Is it only me or is this a very weird style to use Slurm? I got a student being very excited about this "official" tutorial for pyg and it confused him (and me to a lesser degree) a lot. Specifically:
Maybe @puririshi98 can help me out on this points and I can let my student prep a more generic and slurmy version. |
Thanks for the feedback. I agree that the tutorial is missing out on some information in this aspect. It would be great to get your student and @puririshi98 to update the confusing parts. |
@flxmr thank you for your concerns. I will reply back soon w/ more details after discussing internally at NVIDIA. In general this was based on what works on NVIDIA clusters w/ our NVIDIA container. After I follow up w/ detailed answers to each of your points, I would be happy to discuss ways forward. |
Yes, we will prepare something the upcoming week. Our own cluster doesn't have any container-integration, but I found an university-available array of DGX-systems which have, so we can also integrate the container instructions. |
I will make a quick PR for some cleanups:
(note we dont use GRES at nvidia but if you do, you'd need to add --gpus-per-node to your launch script also for |
#8292 |
@flxmr, curious what additional changes would you like to see after my PR gets merged? |
So was in vacation. The I also wonder, on which platform with pyg and CUDA |
address comments from discussion at end of #8071
So, in response to my remarks in #8071 I now prepared this PR for updating the multi-node documentation (sadly no student contrib, they can prep multi-gpu metrics). Reasoning: - if pyg has a tutorial on DDP it's for people who essentially grow within pyg to this usecase. This should be reflected in telling them more about this and be clear on what is done (like `torch.multiprocessing` injecting the rank. it seems very random otherwise). - in the same vein, the multi-node tutorial should evolve from the single-node one imho. - embedding the pyxis-container as the only way to do things is good for Nvidia, but bad for people who arrive on a system without that. Things I skipped: - setting up the worker-count. I would say people should just figure this out on their own. the `os.sched_getaffinity` is cool, but this is a very general problem too and I essentially do it manually now. - anything GRES-related. I doubt that anyone growing into this and installing slurm on their computers will not grok this. Every multi-user, public research system has GRES I'd say (and everyone has their internal docs). --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Akihiro Nitta <[email protected]> Co-authored-by: Rishi Puri <[email protected]>
ready for review