-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
websocket: fix another sync loop #206
Conversation
Problem: A sync loop can be observed in the following situation: 1. A document exists on both the server and the client with divergent heads (both sides have changes the other does not have) 2. The client sends a sync message to the server 3. The server responds, but for some reason the server response is dropped 4. The client reconnects due to not receiving a response or a ping 5. The peers exchange sync messages, but the server thinks it has already sent its changes to the client, so it doesn't sent them. 6. The client notices that it doesn't have the servers changes so it asks for them 7. The server responds with an empty sync message because it thinks it has already sent the changes 6 and 7 continue in an infinite loop. The root cause is the servers failure to clear the sync state associated with the given peer when it receives a new connection from the same peer ID. Solution: when receiving a connection on a websocket server, if we already have a socket for the given peer ID, close the existing socket and emit a "peer-disconnected" event before emitting "peer-connected"
The intention with the sync protocol was that when peers disconnect and reconnect, they should both serialize and unserialize their sync states. That should cause them to forget which unacknowledged changes they sent before the disconnect, allowing those changes to be resent, but remember which changes have already been acknowledged by the other side. Do the "peer-disconnected"/"peer-connected" events have that effect? |
Yep, the loop is happening because the sync states are not being reset and the |
I think this is a little bit of a red herring -- it certainly helps if you can do this, but what we have now is a protocol where a lost message can lead to infinite loops. This can certainly as part of the boundary conditions that occur during a reconnection, but there are doubtless other ways. Without this feature the protocol performs better with a lossless connection but it does not fail. Having advocated for this design I now find myself having lost more than a week of my life to debugging the cause of these cycles on two occasions. I think we'd better find a more resilient approach. |
I think it's important to distinguish message loss within a single connection (that does not time out), and message loss that occurs when a connection is closed or otherwise times out. The latter is already handled by the protocol, as long as the sync state is reset on reconnection. We could also make the protocol tolerate the former type of message loss, but that would require a lot more protocol changes than just removing the check to avoid sending duplicate changes. For example, we would also have to detect loss of the initial message containing the Bloom filter (presumably we'd need to introduce a timeout to detect this message loss) and retransmit it. In my opinion, it would not make much sense to have an application-level timeout and retransmission of messages, because the TCP-level timeout already does exactly what we need in terms of detecting dropped packets, and closing the connection if the packet loss persists. The connection close/timeout is then Automerge's way of finding out that messages have probably been lost. And resetting the sync state is our way of telling Automerge that message loss has occurred. Maybe that's a bad API and we should have an explicit "reconnect" function or something like that, but I think conceptually the model is sound. I understand your frustration, but I'd also like to make sure we are being consistent. It would be weird to assume that some messages can be lost without triggering a connection timeout, and others cannot. If we really want to tolerate message loss in the case where no connection timeout occurred, then we should do it across the board. |
Problem: A sync loop can be observed in the following situation:
6 and 7 continue in an infinite loop. The root cause is the servers failure to clear the sync state associated with the given peer when it receives a new connection from the same peer ID.
Solution: when receiving a connection on a websocket server, if we already have a socket for the given peer ID, close the existing socket and emit a "peer-disconnected" event before emitting "peer-connected"