Enabling the cgroup metrics plugin on systems with cgroupv2 causes jobs to fail #15924

natefoo · 2023-04-11T16:18:44Z

Describe the bug
Newer Linuxes (Ubuntu 22.04+, EL9+) now mount cgroupv2 instead of cgroupv1. Cgroupv2 has a completely different hierarchy, so the plugin's path assumptions are wrong. Because the last statement in the job script is from the plugin and it returns non-zero, this fails the entire job for DRMs like Slurm that use the exit code of the job script to determine the job exit status.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: all

To Reproduce
Steps to reproduce the behavior:

Install Galaxy on Ubuntu 22.04
Enable the cgroup metrics plugin
Configure Galaxy to run jobs on Slurm
Run a job
Error

Expected behavior
Jobs do not fail when the cgroup metrics plugin is enabled and the job runs on a host with cgroupv2 mounted.

Screenshots
N/A

Additional context
cgroupv2 is fairly useless for the one thing that is of interest to Galaxy admins: it has no measurement equivalent to cgroupv1's memory.(memsw.)max_usage_in_bytes (max memory usage recorded). Thus it is recommended that cgroupv2 be disabled and cgroupv1 be mounted instead. This can be done by setting the kernel options:

systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1

Typically this is achieved by appending the options to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, running update-grub as root, and rebooting. Of course, this won't be an option for people who do not admin their own clusters.

The text was updated successfully, but these errors were encountered:

natefoo · 2023-12-04T17:48:52Z

It would appear that max memory usage was finally added at some point in cgroupsv2, as I discovered while checking why my openstack instances have suddenly stopped mounting cgroupsv1 (probably when I switched them from Rocky 8 to Rocky 9). The new file is memory.peak:

[root@js2-xl7 task_0]# cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_1123053/step_batch/user/task_0/memory.peak
59930103808

These instances run whatever the latest ELRepo kernel-ml is. As of the time of writing:

[root@js2-xl7 task_0]# uname -a
Linux js2-xl7.novalocal 6.1.63-1.el9.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 20 11:32:53 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

That said, this bug still applies - all my Slurm job scripts are exiting non-zero and Slurm thinks they have all failed, but since the only nodes I have running cgroupsv2 are via Pulsar, and Pulsar doesn't care about the DRM exit state, I didn't even notice until now.

natefoo · 2024-10-01T16:06:29Z

Fixed in #17169

natefoo mentioned this issue Apr 11, 2023

Disable cgroupv2 galaxyproject/admin-training#171

Merged

mira-miracoli mentioned this issue May 2, 2023

Closes: #401Cgroups v1 usegalaxy-eu/vgcn#64

Merged

natefoo mentioned this issue Dec 11, 2023

[23.2] Add support for Cgroupsv2 #17169

Merged

2 tasks

natefoo closed this as completed Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling the cgroup metrics plugin on systems with cgroupv2 causes jobs to fail #15924

Enabling the cgroup metrics plugin on systems with cgroupv2 causes jobs to fail #15924

natefoo commented Apr 11, 2023 •

edited

Loading

natefoo commented Dec 4, 2023

natefoo commented Oct 1, 2024

Enabling the cgroup metrics plugin on systems with cgroupv2 causes jobs to fail #15924

Enabling the cgroup metrics plugin on systems with cgroupv2 causes jobs to fail #15924

Comments

natefoo commented Apr 11, 2023 • edited Loading

natefoo commented Dec 4, 2023

natefoo commented Oct 1, 2024

natefoo commented Apr 11, 2023 •

edited

Loading