-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synapse main worker locks up in sync_partial_state_room while persisting outliers #16895
Comments
I investigated the issue a bit more. All of these rooms have in common that the partial room sync already receives 100k+ state events. Then almost all of these events are classified as outliers. Outliers are described here synapse/docs/development/room-dag-concepts.md Lines 33 to 71 in 7c805f0
The relevant lines of code are here: synapse/synapse/handlers/federation_event.py Lines 1642 to 1770 in 7c805f0
I also found this very recent PRs: matrix-org/synapse#16649 #16788 I'm asking myself:
|
I just downgraded to version 1.99.0 and this fixed the issue. For testing, I joined the room #fdroid:f-droid.org. Joining still takes long but the main worker doesn't lock up when persisting outliers. The server is still responsive and other messages are still logged. I can send and receive messages on my client. The worst thing I monitored were response times of 10s on The log (filtered for sync_partial_state_room) looks like this now. Persisting is done in batches :
It appears that the changes in #7 released in version 1.100.0 are responsible for this issue. Persisting seems to be a blocking operation and therefore batch-processing was probably the better solution. @erikjohnston I'm pinging you because these PRs came from you. Can you have a look at this, please? |
Oh, hmm. Thanks. My guess is that it has maxed out CPU and so is blocking the reactor from progressing. The solution is to probably periodically yield while handling all the events. #16903 is an attempt to do just that, would be interested if you can test it to see if it helps. |
Thanks for the reply. CPU wasn't a bottleneck. I only monitored CPU usage of 1 CPU from the main worker. The node was at about 1.6 out of 8 CPUs. If anything I/O was the bottleneck. I'll try to test #16903 later. Is there a container image that I can pull or do I need to build it myself? I'm also wondering why the old solution with the batches was discarded. I tested several rooms and didn't have any issues and it was quite fast. |
Oh, interesting. Usually when Synapse "deadlocks" it means that the main thread has become blocked spinning lots of CPU. If you can reproduce I'd be interested in the output of running https://github.com/benfred/py-spy (if possible) against it. That will tell us what it's doing. Alternatively (or as well) temporarily setting the log level to
Thanks. I'm afraid there aren't any prebuilt containers for it.
The problem was that in the previous version the "batching" aspect wasn't intentional, and so it sometime got into a perverse situation where it was doing one event at a time (which was then very slow). |
Which command of py spy would help to debug it? |
We use: |
I've tested it. First of all I wasn't able to run py-spy in a Kubernetes Pod. The OS didn't seem happy about that:
First I joined the room #fdroid:f-droid.org again and there were no problems. Although the amount of synced events was much less even though the room copy seemed to be deleted from the server. Logs filtered for "partial":
Then I tried to enter the room #ecips:matrix.org and the main worker locked up again. It was then killed by the liveness probe after two minutes.
While persisting the endpoint
I would guess that the yield comes too late when everything is already timed out. |
Same for me. Locks completely not even writing to logs and 100% cpu. Both 1.100.0 and 1.101.0. Rollbacked to 1.99.0 and it's fine. Also tested 1.101.0 with #16903 and it seems to fix the problem. |
Interesting. I think that the behavior of synapse/synapse/handlers/federation_event.py Line 1767 in 24ac3bb
is very dependent on the host OS. In my case, it only yielded after over a minute. Whereas, with the batch-processing, the response time is about 10s in the worst case. |
@ggogel that shouldn't be OS dependent as its a Twisted sleep function, rather than an OS level sleep. From the posted logs:
There are a few pauses there. When did it get killed? One thing that is interesting is that it looks like we pause for almost two minutes processing the received events. Are there log lines between them? I've also updated #16903 to also batch up the persistence, in case that was also causing problems. |
I filtered this part of the log for the string "partial". In the last log, you can see that response times go over one minute and all logs are within the same second. This is after "persisting x remaining outliers" is logged. So I think it only yielded after about one minute for a fraction of a second. I've configured the pod to get killed after 3 timeouts of 30s seconds in a 30s interval. So basically if it's unresponsive for two minutes. I'll see if I can test the new changes. |
I built a new image based on 6fa13b4 and the behavior remains unchanged. I'm attaching the full log this time (25k+ lines). After |
Just to double check I made another build based on master (equals tag 1.111.0) with the changes from #16930 cherry-picked and the behavior is still the same. 6fa13b4f927c10b5f4e9495be746ec28849f5cb6-master_2024-02-15_00-45-45.log |
@erikjohnston
I used the build based on 1.111.0 with the changes from #16903. I pulled three flame graphs right after Looks like it's already stalling here: synapse/synapse/handlers/federation_event.py Lines 1741 to 1743 in 6fa13b4
|
We had an EMS host with this issue and applying #16903 on top of v1.101.0 did help: Before the patch the synapse main process was being constantly killed due to liveness probes failing as reported here. |
It probably depends a lot on the room. Which room was it and how many outlier objects did it have? |
@anoadragon453, could you please provide specifics regarding the tests you conducted? It would be helpful to know which rooms were tested and the number of outlier events identified in each. In the screenshot of @jaywink it seems like the CPU was a bottleneck. In my case, there is no CPU bottleneck. |
@anoadragon453 So, was there any internal discussion before deciding on this fix, or was it accepted based on the provided information? I'm asking because if this solution only works for rooms with around 10k outliers but still causes issues in rooms with over 100k outliers, then it might not really solve the problem. @shyrwall also posted that the patch fixes the issue on their side. Then the issue came back again. Because it all depends on the room. If we're not testing the patch on the same rooms then any discussion about the issue being fixed is pointless because the results are not comparable. |
@ggogel The latter. I'm afraid I don't have much time to devote to this issue at the moment, but hope to do so in the next couple of weeks. |
@anoadragon453 we hit this with
Is there an easy way to figure out the room? |
You can filter the logs of the main worker for the keyword "partial". This gives you all the log entries regarding partial join, which should include the room ID. |
|
Most likely. After the partial join is completed, these log messages stop and the temporary partial room copy is turned into a "full" room copy. At least from my understanding. I cannot currently test any official matrix.org rooms, like |
I think we're hitting two different problems after the change.
I created a PR #16968, with additional information on why I believe this is a deadlock. |
@ggogel I think
|
Thanks for the feedback. I also hit this error at some point. I'm working on a different implementation now. |
@jaywink The problem seems to be fixed on the latest version of my PR. I was able to join the rooms #element-web:matrix.org, #synapse:matrix.org and #element-android:matrix.org. I'm going to test more rooms soon. |
This PR aims to fix #16895, caused by a regression in #7 and not fixed by #16903. The PR #16903 only fixes a starvation issue, where the CPU isn't released. There is a second issue, where the execution is blocked. This theory is supported by the flame graphs provided in #16895 and the fact that I see the CPU usage reducing and far below the limit. Since the changes in #7, the method `check_state_independent_auth_rules` is called with the additional parameter `batched_auth_events`: https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/handlers/federation_event.py#L1741-L1743 It makes the execution enter this if clause, introduced with #15195 https://github.com/element-hq/synapse/blob/6fa13b4f927c10b5f4e9495be746ec28849f5cb6/synapse/event_auth.py#L178-L189 There are two issues in the above code snippet. First, there is the blocking issue. I'm not entirely sure if this is a deadlock, starvation, or something different. In the beginning, I thought the copy operation was responsible. It wasn't. Then I investigated the nested `store.get_events` inside the function `update`. This was also not causing the blocking issue. Only when I replaced the set difference operation (`-` ) with a list comprehension, the blocking was resolved. Creating and comparing sets with a very large amount of events seems to be problematic. This is how the flamegraph looks now while persisting outliers. As you can see, the execution no longer locks up in the above function. ![output_2024-02-28_13-59-40](https://github.com/element-hq/synapse/assets/13143850/6db9c9ac-484f-47d0-bdde-70abfbd773ec) Second, the copying here doesn't serve any purpose, because only a shallow copy is created. This means the same objects from the original dict are referenced. This fails the intention of protecting these objects from mutation. The review of the original PR matrix-org/synapse#15195 had an extensive discussion about this matter. Various approaches to copying the auth_events were attempted: 1) Implementing a deepcopy caused issues due to builtins.EventInternalMetadata not being pickleable. 2) Creating a dict with new objects akin to a deepcopy. 3) Creating a dict with new objects containing only necessary attributes. Concluding, there is no easy way to create an actual copy of the objects. Opting for a deepcopy can significantly strain memory and CPU resources, making it an inefficient choice. I don't see why the copy is necessary in the first place. Therefore I'm proposing to remove it altogether. After these changes, I was able to successfully join these rooms, without the main worker locking up: - #synapse:matrix.org - #element-android:matrix.org - #element-web:matrix.org - #ecips:matrix.org - #ipfs-chatter:ipfs.io - #python:matrix.org - #matrix:matrix.org
Description
I've noticed an issue on my homeserver when joining some large public rooms. I tested several rooms.
These rooms are causing the issue:
Joining these similar large rooms works fine:
First, the homeserver starts the partial sync and gets the state events from another homeserver. This works fine. Then it starts persisting these events. Shortly after, the main worker locks up, blocking any incoming connection.
Persisting x remaining outliers
is logged, then the log stops.I first noticed this because the main worker was constantly killed by Kubernetes due to the liveness probe set on
/health
. I tried to remove the liveness probe but even after 30 minutes, it didn't finish persisting the events.All clients lose connection to the server.
Steps to reproduce
Homeserver
gogel.me
Synapse Version
1.100.0
Installation Method
Docker (matrixdotorg/synapse)
Database
PostgreSQL (single server on K8s)
I migrated from SQLite years ago
Workers
Multiple workers
Platform
Kubernetes
Helm chart: https://artifacthub.io/packages/helm/ananace-charts/matrix-synapse
values.yaml
Configuration
No unusual or experimental config options. No synapse modules. See the provided values.yaml.
Relevant log output
Anything else that would be useful to know?
No response
The text was updated successfully, but these errors were encountered: