Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net_ib: return ncclSuccess if read roceTypePath failed and errno is E… #1427

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

limu713
Copy link

@limu713 limu713 commented Sep 5, 2024

…INVAL. #890

Please refer to the problem details:#890

@sjeaugey
Copy link
Member

sjeaugey commented Sep 5, 2024

I don't see how this is related to issue 890. Did you reference the wrong bug ID?

@limu713
Copy link
Author

limu713 commented Sep 9, 2024

I don't see how this is related to issue 890. Did you reference the wrong bug ID?

actually, it is the same question.
"we are using macvlan to virtualize the rdma network card in the container scenario. When a container is inserted into multiple macvlans, we encounter the problem of inconsistent nccl IB GID, which limits the way our rdma network card is used."
One rdma network card has multipe sub interfaces and each sub interface has different gid index.
for example,
container A has the first sub interface with gid index 7, and container B has the second sub interface with gid index 11,
then you can not run miprun with specified parameter of NCCL_IB_GID_INDEX.
issue 890 wants NCCL_IB_GID_INDEX to support array, but with this patch, you can ignore NCCL_IB_GID_INDEX.
It will automatically select the correct gid index.

@sjeaugey
Copy link
Member

sjeaugey commented Sep 9, 2024

NCCL already has the needed code to automatically select the correct GID index.

Your patch is not adding that functionality, it's somehow changing the behavior in case of an error, but it's not clear why that error case would happen and why you'd want to do that. So, at best it's a workaround for a problem; and we'd want to justify why we need that workaround rather than fixing the error itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants