-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesign metrics system with thread-safety in mind #2899
Comments
Related to #1759. In order to remove the |
+1 on each instance of each component to have its own metrics (so one could see how many packets a particular net device 2 has sent, for example). Unfortunately, on the signal handler topic, this is not a fix, and will actually make things even worse (more constrained). Currently, there is a issue of potential deadlock on a global metrics mutex, that can be retried in a signal handler even if it is already taken by same thread in normal (non-signal handler) context. E.g.: doing something like Safest thing to do is just not flush/print metrics in signal handlers 😛 |
Metrics themselves don't need locks since they're atomics, so a solution would also be to decouple the writer from the metric implementation. Then you can use different writers for different contexts with no locking required between them. Or use a single global writer wrapped in a reentrant mutex. Or some other custom approach, the point being to outsource the challenge outside of the metrics implementation. |
You definitely need to flush metrics, or at least one metric in a signal handler. Is important to not lose critical events as seccomp failures. Also I am a bit worried on the performance impact of atomic metrics. As discussed on the issue mentioned above, atomics for metrics on the hotpaths might add a degradation. Depending on architecture specifics, there we need to think alternatives like per thread metrics. |
This is similar to what I was proposing here (actually an idea from @alsrdn):
The metrics lock is only for writing to the file. If we have another file that is used from the signal handler, then this problem is solved. It is then a challenge to make the metrics system available to the signal handler without using a global variable (like with lazy_static). If we're keen on doing any metrics flushing from the signal handler I think we're stuck to using some globally accessible object, which may be fine if we solve the thread-safety issues. In a nutshell, there are two big problems to be solved here:
And indeed the per-component metrics add trouble, unless we have a sophisticated design where each component propagates the metric update to its parent whenever it becomes available (instead triggering it on a specific flush event). But I don't really see how we can use per-component metrics in the signal handler anyway. |
This is indeed a nice approach 👍🏻 |
We removed the task for the roadmap because we are currently not planning to work on it due to higher priority tasks. |
The metrics system is not fully thread-safe at the moment, due to some issues:
IncMetrics
inner state is mutated on serialisation. This causes race conditions when thewrite()
function is called from multiple threads. See: Fix signal handler race condition on metrics.write() #2893SharedIncMetrics
use atomics, they always useRelaxed
ordering. While on x86 memory access has Acquire-Release semantics, on Arm this is not the case. Hence, the process of writing metrics to file may use outdated values.Problems 1 and 3 can be fully solved by removing metrics usage from the signal handler. One option here is to have a special file used for logging the exit reason and the latest metrics values, similar to a coredump. We should also enforce that
METRICS.write()
is called from a single thread (and therefore removing thelazy_static
declaration).Problem 2 could be solved in two ways: by using tighter ordering constraints (need some further dive deep and may incur some overhead due to CPU reordering constraints and prevention of certain compiler optimisations) or by redesigning the metrics system to use per-thread values (this would also solve problem 1).
Another thing to keep in mind is potential need for having device-specific instances of a metric. For example,
METRICS.net.tx_bytes_count
may have sense to be reported per-device instance instead of being aggregated.The text was updated successfully, but these errors were encountered: