Skip to content
This repository has been archived by the owner on Jul 1, 2023. It is now read-only.

Select ibv device who has active port_state. #456

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

SolenoidWGT
Copy link

If the deviceList contains multiple ibv devices, we want to select the device of the port whose port_state is active, instead of just selecting the first device in the deviceList by default. This is very useful. If we choose the first device without checking, it is likely that the IB runtime can be initialized successfully, but some weird errors will be reported in the ibv_post_send stage. At this time, it is difficult to determine the reason for the error is that we chose a wrong ibv device.

This PR is to fix #455.

If the deviceList contains multiple ibv devices, we want to select the
device of the port whose port_state is active, instead of just selecting
the first device in the deviceList by default. This is very useful. If we
choose the first device without checking, it is likely that the IB runtime
can be initialized successfully, but some weird errors will be reported
in the ibv_post_send stage. At this time, it is difficult to determine the
reason for the error is that we chose a wrong ibv device.
@@ -41,7 +41,8 @@ namespace tensorpipe {
_(query_gid, int, (IbvLib::context*, uint8_t, int, IbvLib::gid*)) \
_(query_port, int, (IbvLib::context*, uint8_t, IbvLib::port_attr*)) \
_(reg_mr, IbvLib::mr*, (IbvLib::pd*, void*, size_t, int)) \
_(wc_status_str, const char*, (IbvLib::wc_status))
_(wc_status_str, const char*, (IbvLib::wc_status)) \
_(port_state_str, const char*, (IbvLib::port_state))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could you keep this list sorted alphabetically?

// device of the port whose port_state is active, instead of just selecting
// the first device in the deviceList by default.
for (int i = 0; i < deviceList.size(); i++) {
IbvContext tp_ctx_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our naming convention is camelCase. Also, a trailing underscore means that a name is a private class member, whereas this is just a local variable. Could you just name this ctx?

}
}

TP_THROW_ASSERT_IF(found == false) << "Unable to find available ibv device";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can't find any usable devices we shouldn't consider it an error (and crash the program), instead we should just disable the ibv transport. The logic to do so happens here:

if (deviceList.size() == 0) {
TP_VLOG(7) << "IBV transport is not viable because it couldn't find any "
<< "InfiniBand NICs";
return nullptr;
}

Could you move your code to that file? You will probably need to change the constructor of the Reactor class so that it takes a IbvContext object, instead of an IbvDeviceList.

std::memset(&portAttr, 0, sizeof(portAttr));
tp_ctx_ = createIbvContext(getIbvLib(), deviceList[i]);
TP_CHECK_IBV_INT(ibvLib.query_port(tp_ctx_.get(), kPortNum, &portAttr));
if (portAttr.state == IbvLib::port_state::PORT_ACTIVE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port_state is just an enum, not an enum class, hence its values should be accessed just as IbvLib::PORT_ACTIVE.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error: "transport retry counter exceeded" when torch.distributed.rpc.init_rpc between different pods in k8s
3 participants