-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul event publication lifecycle #796
Comments
Does not being able to tell that an event is being processed also mean that currently multi-instance apps are not an option? I’m not a database expert, but I believe at least PostgreSQL supports row-level locking, which would allow concurrent processing of events by multiple instances, unlike some leader election method. |
@breun The whole processing (sending event -> handler -> mark as finished) is not done by "submitting" the event to the table and a "worker" picks it up. The processing happens always on the instance the Event has been sent in the first place. The publication_log is just there to keep track of what events have been processed. And the only information you currently have is if there is a completion_date on the event and when it got published. We've built our own retry mechanism around the log, in which we are retrying events that are at least n-minutes old, but because it is an "publication_log" we have the issue that sometimes events get processed multiple times, when an event takes a long time to be processed (either because one of the steps take a long time or when a lot of events get sent and they can't get processed because all threads in our thread pool are already busy). And that's where this misconception comes into place. If you work around the fact that the table is not used for processing at all than you may get into these issues if you build your retry mechanism. My thoughts around this topicWe would really wish for a way to distinguish between events that are currently being processed and events that have failed, but in all implementations there are edge cases which you may or may not support in spring-modulith. If you have a dedicated status field (e.g. SUBMITTED, PROCESSING, FINSIHED, FAILED) you can easily find out which events to retry, based on the failed field and can skip the PROCESSING ones - unless you have events that are struck, because the instance went down when processing them. In order to identify them in a multi-instance setup you would have to keep track of which instance is currently active. If you handle it via a failedDate column you have to identify the currently being processed ones via an offset (as described in the issue description) - but here you have to be careful with longer running tasks, because as i mentioned, it can happen that it takes a few minutes until an event is picked up (because of all threads are being utilized) In that case it could make sense to also have an column for when the Event got picked up and the handler is being triggered.... ConclusionThinking more about it, a big problem with the event_publication table is the misconception I mentioned. For example I expected that the event_publication could be seen as a "light" version of an event-externalization, but it does definetely does not work that way (and probably shouldn't be used in that way). From my gut feeling (and talking with colleagues about that) it feels like that I am not the only one that stepped into that. Maybe I am jumping a little bit to a different topic, where this issue isn't about, but I think it should be clearer from the docs that the current event_publication mechanism should not be seen as as an externalized event processing mechanism and should show up the limitations of that. And regarding what the users are expecting (and what @breun even mentioned) Edit:
I just took a look into the docs and Event-Externalization means that you just publish events to other systems so that other applications can get them - not that you consume them from these. |
It is unclear to me what Modulith's responsibility should be: ensuring the delivery of events only or also dealing with problems. To ensure the delivery of events, For handling problems, Therefore, I think Modulith should focus on the delivery side of things, for example, by better tracking the status (event queued, event processing, …) and providing better docs, and perhaps some callbacks, on how to deal with event delivery problems, like event classes that have been removed or event listeners that no longer exist. |
Event publication in modulith is simple and I think(my opinion) that expanding it will get you into a scope creep and you will not know when to stop. Thinking about the problem differently, instead of an event handler that is recorded(and committed) as part of a transaction, what about a job getting scheduled. Now you are solving a different problem, running background jobs, which we have many solutions for, and the typically solve:
I'm experimenting with db-scheduler and it does exactly that. |
Does this mean you assume that only one instance in a multi-instance is actively doing work? I’m still trying to wrap my head around this concept and its implications. I have the feeling I don’t grasp it completely yet, but my main feeling is now that it feels ‘dangerous’ to use this for high-traffic multi-instance services, which is a typical use case for me. I’ve seen most Spring Modulith talks promoting using events instead of direct method calls across module boundaries, and I see the benefits of decoupling via events, but I feel that this event publication mechanism could become a performance/availability risk when it becomes a dependency for every call to a high-traffic module method that publishes an event for other module’s to listen for. In that sense a direct method feels ‘safer’ than using events: no database that could have a full table, or otherwise impact performance or availability. Is this an irrational fear I have of the Spring Modulith event publication mechanism? Should I just configure Modulith to auto-remove completed events in a high-traffic scenario to avoid the event log from growing indefinitely and not worry about it too much otherwise? |
The persistent structure of an event publication currently effectively represents two states. Their default state captures the fact that a transactional event listener will have to be invoked eventually. The publication also stays in that state while the listener processes the event. Once the listener succeeds, the event publication is marked as completed. If the listener fails, the event publication stays in its original state.
This basic lifecycle is easy to work with, but has a couple of downsides. First and foremost, we cannot differentiate between publications that are about to be processed, ones that are processed and ones that have failed. Especially the latter is problematic, as a primary use case supported by the registry is to be able to recover from erroneous situations by resubmitting failed event publications. Developers usually resort to rather fuzzy approaches like considering events that have not been completed in a given time frame to be incomplete.
To improve on this, we’d like to move to a more sophisticated event publication lifecycle that allows to detect failed ones easier. One possible way to achieve this would be to introduce a dedicated status field, or — consistent with the current approach of setting a completion date — a failed date field which would need to be set in case an event listener fails. That step, however, might fail as well, as the erroneous situation that leads to the event listener failing in the first place. That’s why it might make sense to introduce a duration configuration property, after which incomplete event publications might be considered incomplete as well.
The feature bears a bit of risk, as we will have to think about the upgrade process of Spring Modulith applications. Existing apps might still contain entries in the database of incomplete event publications.
Ideas / Action Items
failedDate
column.publishedDate before now() minus duration
.CompletionRegisteringMethodInterceptor
would need to issue the marking as failed on exception.IncompleteEventPublications
would have to get agetFailedPublications()
.Related tickets
The text was updated successfully, but these errors were encountered: