websocket: fix another sync loop #206

alexjg · 2023-10-17T22:20:51Z

Problem: A sync loop can be observed in the following situation:

A document exists on both the server and the client with divergent heads (both sides have changes the other does not have)
The client sends a sync message to the server
The server responds, but for some reason the server response is dropped
The client reconnects due to not receiving a response or a ping
The peers exchange sync messages, but the server thinks it has already sent its changes to the client, so it doesn't sent them.
The client notices that it doesn't have the servers changes so it asks for them
The server responds with an empty sync message because it thinks it has already sent the changes

6 and 7 continue in an infinite loop. The root cause is the servers failure to clear the sync state associated with the given peer when it receives a new connection from the same peer ID.

Solution: when receiving a connection on a websocket server, if we already have a socket for the given peer ID, close the existing socket and emit a "peer-disconnected" event before emitting "peer-connected"

Problem: A sync loop can be observed in the following situation: 1. A document exists on both the server and the client with divergent heads (both sides have changes the other does not have) 2. The client sends a sync message to the server 3. The server responds, but for some reason the server response is dropped 4. The client reconnects due to not receiving a response or a ping 5. The peers exchange sync messages, but the server thinks it has already sent its changes to the client, so it doesn't sent them. 6. The client notices that it doesn't have the servers changes so it asks for them 7. The server responds with an empty sync message because it thinks it has already sent the changes 6 and 7 continue in an infinite loop. The root cause is the servers failure to clear the sync state associated with the given peer when it receives a new connection from the same peer ID. Solution: when receiving a connection on a websocket server, if we already have a socket for the given peer ID, close the existing socket and emit a "peer-disconnected" event before emitting "peer-connected"

ept · 2023-10-18T08:45:53Z

The intention with the sync protocol was that when peers disconnect and reconnect, they should both serialize and unserialize their sync states. That should cause them to forget which unacknowledged changes they sent before the disconnect, allowing those changes to be resent, but remember which changes have already been acknowledged by the other side. Do the "peer-disconnected"/"peer-connected" events have that effect?

alexjg · 2023-10-18T08:55:53Z

Yep, the loop is happening because the sync states are not being reset and the peer-disconnected event causes the sync state to be reset when the peer next connects

pvh · 2023-10-19T06:22:02Z

The intention with the sync protocol was that when peers disconnect and reconnect, they should both serialize and unserialize their sync states.

I think this is a little bit of a red herring -- it certainly helps if you can do this, but what we have now is a protocol where a lost message can lead to infinite loops. This can certainly as part of the boundary conditions that occur during a reconnection, but there are doubtless other ways. Without this feature the protocol performs better with a lossless connection but it does not fail.

Having advocated for this design I now find myself having lost more than a week of my life to debugging the cause of these cycles on two occasions. I think we'd better find a more resilient approach.

ept · 2023-10-19T09:23:45Z

I think it's important to distinguish message loss within a single connection (that does not time out), and message loss that occurs when a connection is closed or otherwise times out. The latter is already handled by the protocol, as long as the sync state is reset on reconnection. We could also make the protocol tolerate the former type of message loss, but that would require a lot more protocol changes than just removing the check to avoid sending duplicate changes. For example, we would also have to detect loss of the initial message containing the Bloom filter (presumably we'd need to introduce a timeout to detect this message loss) and retransmit it.

In my opinion, it would not make much sense to have an application-level timeout and retransmission of messages, because the TCP-level timeout already does exactly what we need in terms of detecting dropped packets, and closing the connection if the packet loss persists. The connection close/timeout is then Automerge's way of finding out that messages have probably been lost. And resetting the sync state is our way of telling Automerge that message loss has occurred. Maybe that's a bad API and we should have an explicit "reconnect" function or something like that, but I think conceptually the model is sound.

I understand your frustration, but I'd also like to make sure we are being consistent. It would be weird to assume that some messages can be lost without triggering a connection timeout, and others cannot. If we really want to tolerate message loss in the case where no connection timeout occurred, then we should do it across the board.

alexjg merged commit a4f1973 into main Oct 18, 2023
1 check passed

hubsmoke mentioned this pull request Jan 3, 2024

Automerge integration desci-labs/nodes#174

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

websocket: fix another sync loop #206

websocket: fix another sync loop #206

alexjg commented Oct 17, 2023

ept commented Oct 18, 2023

alexjg commented Oct 18, 2023

pvh commented Oct 19, 2023

ept commented Oct 19, 2023

websocket: fix another sync loop #206

websocket: fix another sync loop #206

Conversation

alexjg commented Oct 17, 2023

ept commented Oct 18, 2023

alexjg commented Oct 18, 2023

pvh commented Oct 19, 2023

ept commented Oct 19, 2023