-
Hello, I am trying to run a multinode interactive job with In consultation with the local HPC support, I am using The modules I am loading are
When I run the case as I get the following initial output
lot of setup happens and then the code crashes at
Again, I am not sure if I am missing setting some environment variables. The same setup works on 1 node (1GPU) with the same command. I would really appreciate any insight or help. If there is more insight into the nature of this error, and if it is a setup problem, it will also help me communicate with local HPC support. Thank you very much for your help. With sincere regards, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Based on the location, it seems to fail at gs-setup and it's trying to test the GPU-aware MPI. You might need to ask the support to know which MPI support GPU-aware MPI and how to use it. Given that you only have single GPU per node, maybe the NIC is connected via CPU and GPU-MPI won't be needed. |
Beta Was this translation helpful? Give feedback.
Based on the location, it seems to fail at gs-setup and it's trying to test the GPU-aware MPI.
You might need to ask the support to know which MPI support GPU-aware MPI and how to use it.
For example, MPICH might need
MPICH_GPU_SUPPORT_ENABLED=1
Given that you only have single GPU per node, maybe the NIC is connected via CPU and GPU-MPI won't be needed.
You can turn it off in NekRS with
export NEKRS_GPU_MPI=0