Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid agent hanging in case of issues with /proc/PID/cmdline #597

Closed
wants to merge 1 commit into from

Conversation

ttrafelet
Copy link

@ttrafelet ttrafelet commented Jun 16, 2023

We have witnessed an issue where a container in a Kubernetes environment would not be cleaned up properly after having CrashLoopBackOff state. In this case, when trying to access /proc/PID/cmdline, ps will simply hang forever.

Because ps-related commands are run as-is by the Checkmk agent the check_mk_agent process itself will hang as well. This then leads to:

  • a timeout on the Checkmk server
  • one check_mk_agent process (and consequently, one ps process) being spawned every check interval
  • finally, very high CPU load on the system caused by IO wait of the hundreds of hanging processes, rendering the system unusable for production workload.

All the affected systems are running Red Hat Enterprise Linux 8.8.
Checkmk version is 2.1.0p16. Since this really is a misbehaviour of the underlying OS, other versions will be affected as well.

External resource describing this issue (and possible reasons): https://rachelbythebay.com/w/2014/10/27/ps/

Proposed changes

The proposed fix simply adds a waitmax -s 9 5 (and an extra echo for section_ps) to the two commands in the agent that rely on the command output of processes. This command is already present for other sections like nfs mounts, tcp information, and such. Affected are:

  • the ps command used in section_ps
  • the pgrep command used in section_heartbeat (required because pgrep relies on the process' cmdline by design)
  • the scond systemctl command in section_systemd used for [status]
  • the ps command used to determine CURRENT_SHELL

Note: The 5 seconds grace period are an arbitrary value that seemed reasonable to us and was not deduced from detailed analysis or similar.

Result

If a host is facing this issue with being unable to access the cmdline of a process, waitmax will kill the ps or pgrep process after 5 seconds. The agent output will thus be delayed a little, yet will continue with other sections.

As a (potentially hidden) side-effect the monitoring of processes will break, if a monitored process would be listed after the one having issues with its cmdline.

Reasoning

Even though this issue can be considered a bug-like behaviour of the OS itself, we still think this fix should be implemented into the agent. This ensures the host is still being monitored although there might be issues with the processes. As of now, the Checkmk agent will simply run into timeouts as soon as ps starts hanging, rendering the monitoring of this host unusable.

@github-actions
Copy link

github-actions bot commented Jun 16, 2023

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@ttrafelet
Copy link
Author

I have read the CLA Document and I hereby sign the CLA or my organization already has a signed CLA.

Because:
- The agent would hang in case there were issues with accessing the
  cmdline of a running process
- The agent hanging causes loads of hanging checkmk processes

This commit:
- introduces waitmax to every process related command in the agent that
  requires access to the command line information
- ensures only the directly affected sections will break and not others
  by adding an extra echo on waitmax related kills
@ttrafelet ttrafelet force-pushed the ps-commandline-hang-fix branch from 6fc2bae to c59837c Compare June 19, 2023 05:48
@si-23 si-23 added bug Something isn't working Component: Checks & Agents labels Sep 27, 2023
@mo-ki
Copy link
Member

mo-ki commented Oct 20, 2023

Hi Thierry,

Thank you for this exceptionally well written PR. It was an interesting read :-)

That is quite some effort to put into handling a system state that is bad to begin with :-/.
I wonder whether -- if we decide to do this -- we should wrap the affected calls in waitmax in general -- to avoid usage without it then.

But maybe we can just keep it simple.
More importantly: We have to address the possibilty of an empty CURRENT_SHELL variable. I don't think that will go well if we don't do anything about it. However: have you ever observed that ps call to hang? It is restricted to $$, so it should be alright I guess.

Another problem I have is that I am not sure whether those commands may take longer for legitimate reasons.

@mo-ki mo-ki self-assigned this Oct 20, 2023
@ttrafelet
Copy link
Author

ttrafelet commented Oct 25, 2023

Hi Moritz

I am absolutely on your side with this. Principally, it should never come to this situation in the first place (and I have only ever seen it on this occasion).

As the agent is run in intervals and spawns new processes every run, I do think this should be considered, though. Otherwise the agent will exacerbate the underlying issue and slowly "kill" the host. This would direct (part of) the blame towards Checkmk, which is not helpful ;)

I have considered another option as well: Instead of adding waitmax to "every" command, one could ensure only one instance of the agent is ever run at the same time. This would have other side-effects as well, though (like not being able to manually test the agent if it is already running, issues with monitoring from multiple sites etc.). Further, this would basically either lock the second call to the agent until the first is finished (which would not really solve anything I guess), or simply drop the connection, resulting in a CRIT on the site.
I personally prefer for the agent to at least return something rather than nothing at all.

Just an idea - would be a major change I guess: Maybe one could implement logic that tracks whether sections finished in time, that could then be shown in the "Check_MK Agent" check?

Regarding your question: With the proposed fixes the agent never hung again in this scenario. So the call for CURRENT_SHELL should not be affected.

@dnlldl
Copy link
Contributor

dnlldl commented Mar 23, 2024

I've witnessed similar situations on our Linux systems where the agent would hang because of this, eventually leading to a load so high that the system would just crash.

The issue I have with waitmax is, it's very arbitrary, and on some slower systems this might actually become a new problem. I'd like something more resilient but can't think of much. The default timeout for the agent response from the server side is 60 seconds if I recall correctly. Within the new systemd agent services, maybe there could be a watchdog that would kill the agent processes when it's running more than once for multiple checks and somehow report it to the server (by using the spool or some other mechanic) so that we get alerted that something is wrong with the system.

Not something very easy to do, and it's really a corner case. Baking in additionnal code in the agent for this might make it break on some systems. Maybe have a separate agent with a separate ps plugin that we can deploy on systems we know can be more prone to do this for whatever reason (such as with Kubernetes or containers in general).

@mo-ki
Copy link
Member

mo-ki commented Mar 26, 2024

Good point. Disabling the offending section and replacing it by a custom plugin is at least a solution that is scalable (and could be shared on the exchange ;-) )

I'll ping PM about this, it seems a significant amount of effort for an edge case.

@NikCheckmk
Copy link

Hi Thierry,

Many thanks from my side too for your excellently written pull request. Such comprehensive descriptions help us assess the contributions quickly.

Unfortunately however, we have to decline your PR. While it certainly is a valid workaround for this specific misbehavior of the underlying OS, it comes at the cost of introducing the risk of potential side effects for other users, e.g., if the affected commands take longer for legitimate reasons.

However, we will definitely keep your use case in mind so that we can take it into account if we do a more comprehensive overhaul of the agent in the future.

Until then, I would recommend disabling the section and replacing it with a custom plugin as suggested by Moritz.

Thank you very much for your understanding.

Best regards,

Niklas (product manager)

@mo-ki mo-ki closed this Mar 26, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working Component: Checks & Agents
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants