-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve multi-node multi-gpu tutorial #8353
Conversation
- now describes usage with a sbatch-file - alternatively still describes using the pyxis-container - building upon the singlenode-multigpu-example
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a slurm expert, but LGTM. Just pushed a few commits, but feel free to revert them if you think otherwise.
So, now I tried the pyxis-example on a cluster I got access too (and also to check whether this is worth implementing for our own cluster)... and it doesn't work (I built my own container, but I don't know where the master-adress would come from even in the early-access-NGVC one...). Will wrap this into an sbatch-file too and then do a final version. Maybe @puririshi98 can tell me how this is working (I installed pyg into this) |
I didn't find how to make the In addition I noticed by trying this, that having multiple processes download data and then trying to unzip it is not working well (it worked previously, because the data was downloaded already). Fixed that too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the improvements :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it looks good to me we should definitely make sure containers work as my original PR was tested working on a pyxis enabled nvidia cluster using our NVIDIA early access PyG container.
So, I checked again and it seems it seems to be quite a mess though (user→slurm→pyxis→enroot): NVIDIA/pyxis#46 (comment) Maybe this is site-configuration specific, they have it our HPC centers DGX don't have it? Would be nice if you check this, then you can revert the doc-rewrite (it still needs a single process downloading though!) |
@flxmr I will investigate and get back to you as soon as I can |
@flxmr i asked around internally with our enroot team
does this help? I can follow back up if not. |
So, hope this works now for everyone. I added a link to this issue because I suppose if our HPC didn't like the hook, others might not like it too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now, thanks @flxmr for this great PR
So, in response to my remarks in #8071 I now prepared this PR for updating the multi-node documentation (sadly no student contrib, they can prep multi-gpu metrics).
Reasoning:
torch.multiprocessing
injecting the rank. it seems very random otherwise).Things I skipped:
os.sched_getaffinity
is cool, but this is a very general problem too and I essentially do it manually now.