Use dbus to tell when systemd has completed an action for us #3805

clumens · 2025-01-21T14:39:13Z

At the moment, this just handles start and stop actions. I still need to deal with monitors before I can remove a bunch of code and consider the upstream issue closed. But I expect the code to handle monitors to look very similar to what's already here, so people might as well take a look.

Most importantly, replace a tab with spaces so the else blocks line up.

If you scroll up far enough, you'll see that the #endif closes out the same #ifdef that is immediately used again.

First, invert the test to see if this is a systemd resource. If it's not, there's no way goagain was set so we can just bail out to the end of the function. Then having done that, everything that was in the block can be unindented. No other code changes.

If goagain is not set, just bail to the end of the function. Having done that, everything that was in the previous block can be unindented. I think this is a little more legible. No other code changes.

Internally, pacemaker uses a shorter resource-based name when referring to systemd actions. However, systemd itself wants a real unit name so we append ".service" or whatever to the name at exec time. Because we do this at the last minute and don't save it anywhere, there's no way to figure out what action a systemd unit is associated with later. So, add that to the opaque data. This also requires an accessor function since the opaque data is supposed to be private.

At the moment, all this does is log a trace message for all recieved signals. We'll add some filtering on this later.

These are the ones that allow us to know whether a service has finished or not. In the future, we could perhaps open this filter up to other systemd signals if there's a need to do so.

This will get called when we receive a systemd message from dbus and it makes it through the filter in filter_systemd_signals. At the moment, this is only the JobRemoved signal which can be used to determine if a start action we initiated has started. This could be expanded to other signals in the future, if that would be useful.

We'll be handling this through talking to systemd over dbus. systemd has its own concept of timeouts, so we don't need to add an additional timer on top of that.

The action_complete function is called when a systemd action finishes. Unfortunately, StartUnit actions finish and return a value immediately, not when the unit we asked it to start is actually up and running. For this reason, action_complete handles start actions by then scheduling a follow-up monitor action to poll the service and see if it's actually up and running yet. It then takes the appropriate action when this happens. Since we're moving towards waiting for systemd to tells us a service is running, we don't need to do any of this anymore. However, stop actions and monitors still need this code. For the moment, it's easiest to just bail out for start actions.

An inflight operation is one that is currently being executed - either by us (in the case of something that we forked and execed), or systemd (in the case of something we told it to start up). Here, we add a function that just get a systemd inflight op that matches a given unit name.

Instead of being private to libcrmservices, this should now just be an internal-only use function. We'll need to do this to call it from elsewhere.

The executor just has to register a callback that gets called when JobRemoved signals are received from systemd. This callback then takes the data it's given and finds a matching svc_action_t. If there's a match, it can then finalize the action and remove it from the table of actions we're watching for. At the moment this only handles start actions, which is not sufficient to close T25, but it's a good start and should be easy enough to build on to handle stop actions and perhaps also monitors. Related T25

This is just like starting resources, but a lot easier because there's almost no extra code. Related T25

clumens · 2025-01-21T14:40:28Z

daemons/execd/execd_commands.c

+         */
+        pcmk__set_result(&(cmd->result), PCMK_OCF_UNKNOWN_ERROR,
+                         PCMK_EXEC_ERROR, NULL);
+    }


Note the FIXME here. There's several other possibilities systemd could give us - canceled, failed, dependency, and skipped are all documented. These could be broken out into separate errors.

clumens · 2025-01-21T14:41:56Z

lib/services/systemd.c

+    if (!pcmk__str_eq(op->action, PCMK_ACTION_START, pcmk__str_casei)) {
+        services__finalize_async_op(op);
+    }
+


I hate this because it moves a critical part of handling a resource out of the library and into the callers (see handle_systemd_job_complete). But, it's necessary to not call services__finalize_async_op right here because that function will remove the operation from the list of inflight ops and then free it, and it's called before handle_systemd_job_complete so we'll have no op to handle anymore.

clumens · 2025-01-21T15:00:36Z

lib/services/systemd.c

+    crm_trace("Checking DBus message %s.%s() on %s",
+              dbus_message_get_interface(message),
+              dbus_message_get_member(message),
+              dbus_message_get_path(message));


I might remove this message - it's kind of a lot, even at trace level, since it will be triggered for every single thing systemd ever does after pacemaker has started.

clumens · 2025-01-21T16:11:35Z

daemons/execd/execd_commands.c

+        pcmk__set_result(&(cmd->result), PCMK_OCF_UNKNOWN_ERROR,
+                         PCMK_EXEC_TIMEOUT,
+                         "Investigate reason for timeout, and adjust "
+                         "configured operation timeout if necessary");


I'm not convinced this is the correct timeout handling. I haven't specifically tested timeouts yet. Previously, we were adding a timer and if that ever triggered, it would call systemd_timeout_callback. However, systemd does also have its own concept of timeouts. That's what I'm now doing - if systemd detects a timeout, it will send us a JobRemoved() signal that will land here. I'm not doing exactly the same thing as systemd_timeout_callback, however. We have too many similar functions to do this stuff.

I guess the biggest change here is that we no longer respect the timeout attached to the operation, in favor of the systemd timeout. Maybe that's a change I can't get away with making.

Actually, maybe using the systemd timeout is fine - see invoke_unit_by_path. We set up an override file for the unit we're stopping/starting, and give it a timeout value. So, systemd should hit that timeout and let us know. I might still be doing the wrong actions upon timeout, however.

The comment should explain this pretty well. What was happening was that an executor regression test was failing because the operation never gets finalized. Essentially, services__finalize_async_op never gets called for a unit that doesn't exist. We typically want this to happen as part of the systemd callback that's registered in the executor, but this callback will never get called in this case because there's no systemd message that will make it through our filter and trigger the callback. JobRemoved won't get sent because there's no job to remove. It seems like we could look for a failure from LoadUnit for units that don't exist, but pacemaker always writes an override unit file which means the unit will always exist.

clumens added 15 commits January 20, 2025 14:31

Refactor: libcrmservice: systemd_init should return a bool.

c8f208d

Refactor: daemons: Fix whitespace problems in execd_commands.c.

a328491

Most importantly, replace a tab with spaces so the else blocks line up.

Refactor: daemons: Get rid of an unnecessary #endif/#ifdef.

615aab8

If you scroll up far enough, you'll see that the #endif closes out the same #ifdef that is immediately used again.

Refactor: daemons: Unindent the goagain block in action_complete.

c8aabb8

If goagain is not set, just bail to the end of the function. Having done that, everything that was in the previous block can be unindented. I think this is a little more legible. No other code changes.

Feature: libcrmservice: Subscribe to systemd dbus signals.

874fb8b

At the moment, all this does is log a trace message for all recieved signals. We'll add some filtering on this later.

Feature: libcrmservice: Only respond to JobRemoved signals.

ed9fceb

These are the ones that allow us to know whether a service has finished or not. In the future, we could perhaps open this filter up to other systemd signals if there's a need to do so.

Feature: systemd: Don't initiate a timer for start actions.

78bbb71

We'll be handling this through talking to systemd over dbus. systemd has its own concept of timeouts, so we don't need to add an additional timer on top of that.

Refactor: libcrmservices: Expose services__finalize_async_op.

10c0a0d

Instead of being private to libcrmservices, this should now just be an internal-only use function. We'll need to do this to call it from elsewhere.

Feature: daemons: Use dbus to detect systemd resources stopping.

cc89994

This is just like starting resources, but a lot easier because there's almost no extra code. Related T25

clumens requested a review from nrwahl2 January 21, 2025 14:39

clumens commented Jan 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dbus to tell when systemd has completed an action for us #3805

Use dbus to tell when systemd has completed an action for us #3805

clumens commented Jan 21, 2025

clumens Jan 21, 2025

clumens Jan 21, 2025

clumens Jan 21, 2025

clumens Jan 21, 2025 •

edited

Loading

clumens Jan 21, 2025

Use dbus to tell when systemd has completed an action for us #3805

Are you sure you want to change the base?

Use dbus to tell when systemd has completed an action for us #3805

Conversation

clumens commented Jan 21, 2025

clumens Jan 21, 2025

Choose a reason for hiding this comment

clumens Jan 21, 2025

Choose a reason for hiding this comment

clumens Jan 21, 2025

Choose a reason for hiding this comment

clumens Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

clumens Jan 21, 2025

Choose a reason for hiding this comment

clumens Jan 21, 2025 •

edited

Loading