-
Notifications
You must be signed in to change notification settings - Fork 474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid agent hanging in case of issues with /proc/PID/cmdline #597
Conversation
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA. |
Because: - The agent would hang in case there were issues with accessing the cmdline of a running process - The agent hanging causes loads of hanging checkmk processes This commit: - introduces waitmax to every process related command in the agent that requires access to the command line information - ensures only the directly affected sections will break and not others by adding an extra echo on waitmax related kills
6fc2bae
to
c59837c
Compare
Hi Thierry, Thank you for this exceptionally well written PR. It was an interesting read :-) That is quite some effort to put into handling a system state that is bad to begin with :-/. But maybe we can just keep it simple. Another problem I have is that I am not sure whether those commands may take longer for legitimate reasons. |
Hi Moritz I am absolutely on your side with this. Principally, it should never come to this situation in the first place (and I have only ever seen it on this occasion). As the agent is run in intervals and spawns new processes every run, I do think this should be considered, though. Otherwise the agent will exacerbate the underlying issue and slowly "kill" the host. This would direct (part of) the blame towards Checkmk, which is not helpful ;) I have considered another option as well: Instead of adding waitmax to "every" command, one could ensure only one instance of the agent is ever run at the same time. This would have other side-effects as well, though (like not being able to manually test the agent if it is already running, issues with monitoring from multiple sites etc.). Further, this would basically either lock the second call to the agent until the first is finished (which would not really solve anything I guess), or simply drop the connection, resulting in a CRIT on the site. Just an idea - would be a major change I guess: Maybe one could implement logic that tracks whether sections finished in time, that could then be shown in the "Check_MK Agent" check? Regarding your question: With the proposed fixes the agent never hung again in this scenario. So the call for |
I've witnessed similar situations on our Linux systems where the agent would hang because of this, eventually leading to a load so high that the system would just crash. The issue I have with Not something very easy to do, and it's really a corner case. Baking in additionnal code in the agent for this might make it break on some systems. Maybe have a separate agent with a separate |
Good point. Disabling the offending section and replacing it by a custom plugin is at least a solution that is scalable (and could be shared on the exchange ;-) ) I'll ping PM about this, it seems a significant amount of effort for an edge case. |
Hi Thierry, Many thanks from my side too for your excellently written pull request. Such comprehensive descriptions help us assess the contributions quickly. Unfortunately however, we have to decline your PR. While it certainly is a valid workaround for this specific misbehavior of the underlying OS, it comes at the cost of introducing the risk of potential side effects for other users, e.g., if the affected commands take longer for legitimate reasons. However, we will definitely keep your use case in mind so that we can take it into account if we do a more comprehensive overhaul of the agent in the future. Until then, I would recommend disabling the section and replacing it with a custom plugin as suggested by Moritz. Thank you very much for your understanding. Best regards, Niklas (product manager) |
We have witnessed an issue where a container in a Kubernetes environment would not be cleaned up properly after having CrashLoopBackOff state. In this case, when trying to access /proc/PID/cmdline, ps will simply hang forever.
Because ps-related commands are run as-is by the Checkmk agent the check_mk_agent process itself will hang as well. This then leads to:
All the affected systems are running Red Hat Enterprise Linux 8.8.
Checkmk version is 2.1.0p16. Since this really is a misbehaviour of the underlying OS, other versions will be affected as well.
External resource describing this issue (and possible reasons): https://rachelbythebay.com/w/2014/10/27/ps/
Proposed changes
The proposed fix simply adds a
waitmax -s 9 5
(and an extraecho
for section_ps) to the two commands in the agent that rely on the command output of processes. This command is already present for other sections like nfs mounts, tcp information, and such. Affected are:[status]
CURRENT_SHELL
Note: The 5 seconds grace period are an arbitrary value that seemed reasonable to us and was not deduced from detailed analysis or similar.
Result
If a host is facing this issue with being unable to access the cmdline of a process, waitmax will kill the ps or pgrep process after 5 seconds. The agent output will thus be delayed a little, yet will continue with other sections.
As a (potentially hidden) side-effect the monitoring of processes will break, if a monitored process would be listed after the one having issues with its cmdline.
Reasoning
Even though this issue can be considered a bug-like behaviour of the OS itself, we still think this fix should be implemented into the agent. This ensures the host is still being monitored although there might be issues with the processes. As of now, the Checkmk agent will simply run into timeouts as soon as ps starts hanging, rendering the monitoring of this host unusable.