Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

websocket: fix another sync loop #206

Merged
merged 1 commit into from
Oct 18, 2023
Merged

websocket: fix another sync loop #206

merged 1 commit into from
Oct 18, 2023

Conversation

alexjg
Copy link
Contributor

@alexjg alexjg commented Oct 17, 2023

Problem: A sync loop can be observed in the following situation:

  1. A document exists on both the server and the client with divergent heads (both sides have changes the other does not have)
  2. The client sends a sync message to the server
  3. The server responds, but for some reason the server response is dropped
  4. The client reconnects due to not receiving a response or a ping
  5. The peers exchange sync messages, but the server thinks it has already sent its changes to the client, so it doesn't sent them.
  6. The client notices that it doesn't have the servers changes so it asks for them
  7. The server responds with an empty sync message because it thinks it has already sent the changes

6 and 7 continue in an infinite loop. The root cause is the servers failure to clear the sync state associated with the given peer when it receives a new connection from the same peer ID.

Solution: when receiving a connection on a websocket server, if we already have a socket for the given peer ID, close the existing socket and emit a "peer-disconnected" event before emitting "peer-connected"

Problem: A sync loop can be observed in the following situation:

1. A document exists on both the server and the client with divergent
   heads (both sides have changes the other does not have)
2. The client sends a sync message to the server
3. The server responds, but for some reason the server response is
   dropped
4. The client reconnects due to not receiving a response or a ping
5. The peers exchange sync messages, but the server thinks it has
   already sent its changes to the client, so it doesn't sent them.
6. The client notices that it doesn't have the servers changes so it
   asks for them
7. The server responds with an empty sync message because it thinks it
   has already sent the changes

6 and 7 continue in an infinite loop. The root cause is the servers
failure to clear the sync state associated with the given peer when
it receives a new connection from the same peer ID.

Solution: when receiving a connection on a websocket server, if we
already have a socket for the given peer ID, close the existing socket
and emit a "peer-disconnected" event before emitting "peer-connected"
@ept
Copy link
Member

ept commented Oct 18, 2023

The intention with the sync protocol was that when peers disconnect and reconnect, they should both serialize and unserialize their sync states. That should cause them to forget which unacknowledged changes they sent before the disconnect, allowing those changes to be resent, but remember which changes have already been acknowledged by the other side. Do the "peer-disconnected"/"peer-connected" events have that effect?

@alexjg
Copy link
Contributor Author

alexjg commented Oct 18, 2023

Yep, the loop is happening because the sync states are not being reset and the peer-disconnected event causes the sync state to be reset when the peer next connects

@alexjg alexjg merged commit a4f1973 into main Oct 18, 2023
1 check passed
@pvh
Copy link
Member

pvh commented Oct 19, 2023

The intention with the sync protocol was that when peers disconnect and reconnect, they should both serialize and unserialize their sync states.

I think this is a little bit of a red herring -- it certainly helps if you can do this, but what we have now is a protocol where a lost message can lead to infinite loops. This can certainly as part of the boundary conditions that occur during a reconnection, but there are doubtless other ways. Without this feature the protocol performs better with a lossless connection but it does not fail.

Having advocated for this design I now find myself having lost more than a week of my life to debugging the cause of these cycles on two occasions. I think we'd better find a more resilient approach.

@ept
Copy link
Member

ept commented Oct 19, 2023

I think it's important to distinguish message loss within a single connection (that does not time out), and message loss that occurs when a connection is closed or otherwise times out. The latter is already handled by the protocol, as long as the sync state is reset on reconnection. We could also make the protocol tolerate the former type of message loss, but that would require a lot more protocol changes than just removing the check to avoid sending duplicate changes. For example, we would also have to detect loss of the initial message containing the Bloom filter (presumably we'd need to introduce a timeout to detect this message loss) and retransmit it.

In my opinion, it would not make much sense to have an application-level timeout and retransmission of messages, because the TCP-level timeout already does exactly what we need in terms of detecting dropped packets, and closing the connection if the packet loss persists. The connection close/timeout is then Automerge's way of finding out that messages have probably been lost. And resetting the sync state is our way of telling Automerge that message loss has occurred. Maybe that's a bad API and we should have an explicit "reconnect" function or something like that, but I think conceptually the model is sound.

I understand your frustration, but I'd also like to make sure we are being consistent. It would be weird to assume that some messages can be lost without triggering a connection timeout, and others cannot. If we really want to tolerate message loss in the case where no connection timeout occurred, then we should do it across the board.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants