Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dTLS handshake race condition #628

Closed
szatanjl opened this issue Nov 8, 2024 · 3 comments
Closed

[BUG] dTLS handshake race condition #628

szatanjl opened this issue Nov 8, 2024 · 3 comments

Comments

@szatanjl
Copy link

szatanjl commented Nov 8, 2024

TL;DR

There is a race condition in the handling of dTLS handshake at least in the latest version that is v0.11.0.
In particular in the tokio::select! at read_and_buffer()@dtls/src/conn/mod.rs line 832.

Read below for further details.

What I did

I have created a simple program that is connecting two peers over WebRTC protocol and sends messages via data channel.
The most basic WebRTC chat app.
I have implemented it 3 times: in JS for web browsers, in JS for Node, and in Rust using this WebRTC crate.

You can find the code on my github: https://github.com/szatanjl/rtc-share-poc.

browser/node <-> browser/node works without any issues in any combination.

Trying to connect web browser with the rust does not work however.
It works maybe 1 out of 10 times. The WebRTC connection always works, and the connectionState is "connected", but the data channel rarely opens up and the ondatachannel is rarely called.

I have debugged the issue and came to the conclusion that there is a race condition in the code of this crate.

I created a patch with additional println!() statements to showcase what is wrong (patch 0001-prints.patch).

Good

When everything works and data channel opens up then I can see in the logs something like:

TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake done RX
-- Chat open

As you can see from the logs handshake is done, it succedes and we open our data channel ("chat").
We "luckily" choose the _ = ctx.handshake_done_rx.recv() => { part of the tokio::select! in the read_and_buffer().

Bad

When it doesn't work out though I can see something like this:

(...)
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake TX

As you can see from the logs handshake is done, it has succeded, but we do not open the data channel.
We get "unlucky" and we choose the _ = ctx.handshake_tx.send(done_tx) => { in the tokio::select! in read_and_buffer().
And because of this we land in the infinite while loop because it never gets the done_rx.recv() message.

Cause

I am pretty sure that this has nothing to do with the network.
I have investigated this using wireshark and all the traffic is correct.
Web browsers (firefox or chromium) work correctly and send all the necessary packets and nothing is lost.

What happens is that there is a race condition in that select and we should never land in the while loop since the handshake is already done.

Workarounds

I can see two workaround for this issue.

Workaround 1

First one is to "repeat" the _ = ctx.handshake_done_rx.recv() => { inside of the tokio::select! inside of the while loop and terminate that loop if the handshake is done.

This way the loop is no longer infinite and even if we get "unlucky" in the first select it will still work out.

See 0002-workaround-1.patch.

Workaround 2

Second option is to use biased select and switch order of the futures.
This way we should never get "unlucky" and we should always prioritize the handshake_done_rx.recv() over handshake_tx.send().

I am less certain of the correctness of this one but it has worked so far in my tests and never produced a race condition.

See 0002-workaround-2.patch.

Proper solution

I believe the proper solution is to never put mpsc::Sender::send() in the tokio::select! in the first place.
And so the proper solution would be to get rid of the select completely.
Unfortunately I don't know exactly how this would have to look like then in the code.
I find it hard to completely grasp what is happening in the code, it is a bit complex for me.
And so I wouldn't want to propose what a proper solution should look like.
Hopefully you can figure that out having better knowledge of the codebase.

I think that maybe the author incorrectly thought that send() will block until other end of channel receives but that is not the case.

Citation from tokio docs on mpsc::Sender::send(): "Sends a value, waiting until there is capacity."
send() doesn't wait for the other end to receive, it only waits until there is a capacity in the channel which in this case means it doesn't wait at all.

And all this in turn means that we just roll dice.
tokio::select! can always choose the handshake_tx.send() option even though we don't want it to cause the handshake is already done.

Final words

I am attaching the three mentioned patches as well.
In case of any questions I'll try to help.

0001-prints.patch.txt
0002-workaround-1.patch.txt
0002-workaround-2.patch.txt

@szatanjl szatanjl changed the title dTLS handshake race condition [BUG] dTLS handshake race condition Nov 8, 2024
@mutexd
Copy link
Contributor

mutexd commented Nov 17, 2024

Nice analysis.

I took a look at the code from Pion::dtls and this part actually copies the logic from there. The catch is that, the tokio:: mpsc::channel with size 1 is used here and it behaves differently from the golang::channel. The golang channel is actually a rendezvous channel (size=0 channel). It means that the sender is blocked until the receiver receives, then both of them can proceed. However, the tokio::mpsc::sender can proceed as long as there is capacity in the channel. Channel with size=1 is not a rendezvous channel and tokio::mpsc::channel doesn't support size=0.

I'm convinced that it's better to use a rendezvous channel for handshake_tx than overhauling the logic copied from Pion::dtls. However, I haven't been able to find a proper replacement and I would suggest adding one additional logic around the handshake_tx to circumvent the issue. Please have a look at #631 where we replace handshake_tx part with an async block with oneshot channel. I've tested it with your POC and it works fine.

Could you help to test the PR branch with your test environment?

@szatanjl
Copy link
Author

Could you help to test the PR branch with your test environment?

I will test it once I find time and I will get back to you

@szatanjl
Copy link
Author

Tested your fix on both master and as a cherry-picked commit on top of v0.11.0. It works and there is no race-condition anymore. Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants