[BUG] dTLS handshake race condition #628

szatanjl · 2024-11-08T12:14:15Z

TL;DR

There is a race condition in the handling of dTLS handshake at least in the latest version that is v0.11.0.
In particular in the tokio::select! at read_and_buffer()@dtls/src/conn/mod.rs line 832.

Read below for further details.

What I did

I have created a simple program that is connecting two peers over WebRTC protocol and sends messages via data channel.
The most basic WebRTC chat app.
I have implemented it 3 times: in JS for web browsers, in JS for Node, and in Rust using this WebRTC crate.

You can find the code on my github: https://github.com/szatanjl/rtc-share-poc.

browser/node <-> browser/node works without any issues in any combination.

Trying to connect web browser with the rust does not work however.
It works maybe 1 out of 10 times. The WebRTC connection always works, and the connectionState is "connected", but the data channel rarely opens up and the ondatachannel is rarely called.

I have debugged the issue and came to the conclusion that there is a race condition in the code of this crate.

I created a patch with additional println!() statements to showcase what is wrong (patch 0001-prints.patch).

Good

When everything works and data channel opens up then I can see in the logs something like:

TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake done RX
-- Chat open

As you can see from the logs handshake is done, it succedes and we open our data channel ("chat").
We "luckily" choose the _ = ctx.handshake_done_rx.recv() => { part of the tokio::select! in the read_and_buffer().

Bad

When it doesn't work out though I can see something like this:

(...)
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake TX

As you can see from the logs handshake is done, it has succeded, but we do not open the data channel.
We get "unlucky" and we choose the _ = ctx.handshake_tx.send(done_tx) => { in the tokio::select! in read_and_buffer().
And because of this we land in the infinite while loop because it never gets the done_rx.recv() message.

Cause

I am pretty sure that this has nothing to do with the network.
I have investigated this using wireshark and all the traffic is correct.
Web browsers (firefox or chromium) work correctly and send all the necessary packets and nothing is lost.

What happens is that there is a race condition in that select and we should never land in the while loop since the handshake is already done.

Workarounds

I can see two workaround for this issue.

Workaround 1

First one is to "repeat" the _ = ctx.handshake_done_rx.recv() => { inside of the tokio::select! inside of the while loop and terminate that loop if the handshake is done.

This way the loop is no longer infinite and even if we get "unlucky" in the first select it will still work out.

See 0002-workaround-1.patch.

Workaround 2

Second option is to use biased select and switch order of the futures.
This way we should never get "unlucky" and we should always prioritize the handshake_done_rx.recv() over handshake_tx.send().

I am less certain of the correctness of this one but it has worked so far in my tests and never produced a race condition.

See 0002-workaround-2.patch.

Proper solution

I believe the proper solution is to never put mpsc::Sender::send() in the tokio::select! in the first place.
And so the proper solution would be to get rid of the select completely.
Unfortunately I don't know exactly how this would have to look like then in the code.
I find it hard to completely grasp what is happening in the code, it is a bit complex for me.
And so I wouldn't want to propose what a proper solution should look like.
Hopefully you can figure that out having better knowledge of the codebase.

I think that maybe the author incorrectly thought that send() will block until other end of channel receives but that is not the case.

Citation from tokio docs on mpsc::Sender::send(): "Sends a value, waiting until there is capacity."
send() doesn't wait for the other end to receive, it only waits until there is a capacity in the channel which in this case means it doesn't wait at all.

And all this in turn means that we just roll dice.
tokio::select! can always choose the handshake_tx.send() option even though we don't want it to cause the handshake is already done.

Final words

I am attaching the three mentioned patches as well.
In case of any questions I'll try to help.

0001-prints.patch.txt
0002-workaround-1.patch.txt
0002-workaround-2.patch.txt

The text was updated successfully, but these errors were encountered:

mutexd · 2024-11-17T11:13:15Z

Nice analysis.

I took a look at the code from Pion::dtls and this part actually copies the logic from there. The catch is that, the tokio:: mpsc::channel with size 1 is used here and it behaves differently from the golang::channel. The golang channel is actually a rendezvous channel (size=0 channel). It means that the sender is blocked until the receiver receives, then both of them can proceed. However, the tokio::mpsc::sender can proceed as long as there is capacity in the channel. Channel with size=1 is not a rendezvous channel and tokio::mpsc::channel doesn't support size=0.

I'm convinced that it's better to use a rendezvous channel for handshake_tx than overhauling the logic copied from Pion::dtls. However, I haven't been able to find a proper replacement and I would suggest adding one additional logic around the handshake_tx to circumvent the issue. Please have a look at #631 where we replace handshake_tx part with an async block with oneshot channel. I've tested it with your POC and it works fine.

Could you help to test the PR branch with your test environment?

szatanjl · 2024-11-18T08:51:22Z

Could you help to test the PR branch with your test environment?

I will test it once I find time and I will get back to you

szatanjl · 2024-11-19T08:35:33Z

Tested your fix on both master and as a cherry-picked commit on top of v0.11.0. It works and there is no race-condition anymore. Thank you

szatanjl changed the title ~~dTLS handshake race condition~~ [BUG] dTLS handshake race condition Nov 8, 2024

mutexd mentioned this issue Nov 17, 2024

Fix dtls handshake potential hanging issue #631

Merged

szatanjl closed this as completed Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dTLS handshake race condition #628

[BUG] dTLS handshake race condition #628

szatanjl commented Nov 8, 2024

mutexd commented Nov 17, 2024 •

edited

Loading

szatanjl commented Nov 18, 2024

szatanjl commented Nov 19, 2024

[BUG] dTLS handshake race condition #628

[BUG] dTLS handshake race condition #628

Comments

szatanjl commented Nov 8, 2024

TL;DR

What I did

Good

Bad

Cause

Workarounds

Workaround 1

Workaround 2

Proper solution

Final words

mutexd commented Nov 17, 2024 • edited Loading

szatanjl commented Nov 18, 2024

szatanjl commented Nov 19, 2024

mutexd commented Nov 17, 2024 •

edited

Loading