-
-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] dTLS handshake race condition #628
Comments
Nice analysis. I took a look at the code from I'm convinced that it's better to use a rendezvous channel for Could you help to test the PR branch with your test environment? |
I will test it once I find time and I will get back to you |
Tested your fix on both master and as a cherry-picked commit on top of v0.11.0. It works and there is no race-condition anymore. Thank you |
TL;DR
There is a race condition in the handling of dTLS handshake at least in the latest version that is v0.11.0.
In particular in the
tokio::select!
atread_and_buffer()@dtls/src/conn/mod.rs
line 832.Read below for further details.
What I did
I have created a simple program that is connecting two peers over WebRTC protocol and sends messages via data channel.
The most basic WebRTC chat app.
I have implemented it 3 times: in JS for web browsers, in JS for Node, and in Rust using this WebRTC crate.
You can find the code on my github: https://github.com/szatanjl/rtc-share-poc.
browser/node <-> browser/node works without any issues in any combination.
Trying to connect web browser with the rust does not work however.
It works maybe 1 out of 10 times. The WebRTC connection always works, and the connectionState is "connected", but the data channel rarely opens up and the ondatachannel is rarely called.
I have debugged the issue and came to the conclusion that there is a race condition in the code of this crate.
I created a patch with additional
println!()
statements to showcase what is wrong (patch 0001-prints.patch).Good
When everything works and data channel opens up then I can see in the logs something like:
As you can see from the logs handshake is done, it succedes and we open our data channel ("chat").
We "luckily" choose the
_ = ctx.handshake_done_rx.recv() => {
part of thetokio::select!
in theread_and_buffer()
.Bad
When it doesn't work out though I can see something like this:
As you can see from the logs handshake is done, it has succeded, but we do not open the data channel.
We get "unlucky" and we choose the
_ = ctx.handshake_tx.send(done_tx) => {
in thetokio::select!
inread_and_buffer()
.And because of this we land in the infinite while loop because it never gets the
done_rx.recv()
message.Cause
I am pretty sure that this has nothing to do with the network.
I have investigated this using wireshark and all the traffic is correct.
Web browsers (firefox or chromium) work correctly and send all the necessary packets and nothing is lost.
What happens is that there is a race condition in that select and we should never land in the while loop since the handshake is already done.
Workarounds
I can see two workaround for this issue.
Workaround 1
First one is to "repeat" the
_ = ctx.handshake_done_rx.recv() => {
inside of thetokio::select!
inside of the while loop and terminate that loop if the handshake is done.This way the loop is no longer infinite and even if we get "unlucky" in the first select it will still work out.
See 0002-workaround-1.patch.
Workaround 2
Second option is to use biased select and switch order of the futures.
This way we should never get "unlucky" and we should always prioritize the
handshake_done_rx.recv()
overhandshake_tx.send()
.I am less certain of the correctness of this one but it has worked so far in my tests and never produced a race condition.
See 0002-workaround-2.patch.
Proper solution
I believe the proper solution is to never put mpsc::Sender::send() in the tokio::select! in the first place.
And so the proper solution would be to get rid of the select completely.
Unfortunately I don't know exactly how this would have to look like then in the code.
I find it hard to completely grasp what is happening in the code, it is a bit complex for me.
And so I wouldn't want to propose what a proper solution should look like.
Hopefully you can figure that out having better knowledge of the codebase.
I think that maybe the author incorrectly thought that send() will block until other end of channel receives but that is not the case.
Citation from tokio docs on mpsc::Sender::send(): "Sends a value, waiting until there is capacity."
send() doesn't wait for the other end to receive, it only waits until there is a capacity in the channel which in this case means it doesn't wait at all.
And all this in turn means that we just roll dice.
tokio::select!
can always choose thehandshake_tx.send()
option even though we don't want it to cause the handshake is already done.Final words
I am attaching the three mentioned patches as well.
In case of any questions I'll try to help.
0001-prints.patch.txt
0002-workaround-1.patch.txt
0002-workaround-2.patch.txt
The text was updated successfully, but these errors were encountered: