-
Notifications
You must be signed in to change notification settings - Fork 76
Select ibv device who has active port_state. #456
base: main
Are you sure you want to change the base?
Conversation
If the deviceList contains multiple ibv devices, we want to select the device of the port whose port_state is active, instead of just selecting the first device in the deviceList by default. This is very useful. If we choose the first device without checking, it is likely that the IB runtime can be initialized successfully, but some weird errors will be reported in the ibv_post_send stage. At this time, it is difficult to determine the reason for the error is that we chose a wrong ibv device.
@@ -41,7 +41,8 @@ namespace tensorpipe { | |||
_(query_gid, int, (IbvLib::context*, uint8_t, int, IbvLib::gid*)) \ | |||
_(query_port, int, (IbvLib::context*, uint8_t, IbvLib::port_attr*)) \ | |||
_(reg_mr, IbvLib::mr*, (IbvLib::pd*, void*, size_t, int)) \ | |||
_(wc_status_str, const char*, (IbvLib::wc_status)) | |||
_(wc_status_str, const char*, (IbvLib::wc_status)) \ | |||
_(port_state_str, const char*, (IbvLib::port_state)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: could you keep this list sorted alphabetically?
// device of the port whose port_state is active, instead of just selecting | ||
// the first device in the deviceList by default. | ||
for (int i = 0; i < deviceList.size(); i++) { | ||
IbvContext tp_ctx_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our naming convention is camelCase
. Also, a trailing underscore means that a name is a private class member, whereas this is just a local variable. Could you just name this ctx
?
} | ||
} | ||
|
||
TP_THROW_ASSERT_IF(found == false) << "Unable to find available ibv device"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't find any usable devices we shouldn't consider it an error (and crash the program), instead we should just disable the ibv transport. The logic to do so happens here:
tensorpipe/tensorpipe/transport/ibv/context_impl.cc
Lines 58 to 62 in bb1473a
if (deviceList.size() == 0) { | |
TP_VLOG(7) << "IBV transport is not viable because it couldn't find any " | |
<< "InfiniBand NICs"; | |
return nullptr; | |
} |
Could you move your code to that file? You will probably need to change the constructor of the Reactor class so that it takes a IbvContext object, instead of an IbvDeviceList.
std::memset(&portAttr, 0, sizeof(portAttr)); | ||
tp_ctx_ = createIbvContext(getIbvLib(), deviceList[i]); | ||
TP_CHECK_IBV_INT(ibvLib.query_port(tp_ctx_.get(), kPortNum, &portAttr)); | ||
if (portAttr.state == IbvLib::port_state::PORT_ACTIVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port_state
is just an enum, not an enum class, hence its values should be accessed just as IbvLib::PORT_ACTIVE
.
If the deviceList contains multiple ibv devices, we want to select the device of the port whose port_state is active, instead of just selecting the first device in the deviceList by default. This is very useful. If we choose the first device without checking, it is likely that the IB runtime can be initialized successfully, but some weird errors will be reported in the ibv_post_send stage. At this time, it is difficult to determine the reason for the error is that we chose a wrong ibv device.
This PR is to fix #455.