You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Newer Linuxes (Ubuntu 22.04+, EL9+) now mount cgroupv2 instead of cgroupv1. Cgroupv2 has a completely different hierarchy, so the plugin's path assumptions are wrong. Because the last statement in the job script is from the plugin and it returns non-zero, this fails the entire job for DRMs like Slurm that use the exit code of the job script to determine the job exit status.
Galaxy Version and/or server at which you observed the bug
Galaxy Version: all
To Reproduce
Steps to reproduce the behavior:
Install Galaxy on Ubuntu 22.04
Enable the cgroup metrics plugin
Configure Galaxy to run jobs on Slurm
Run a job
Error
Expected behavior
Jobs do not fail when the cgroup metrics plugin is enabled and the job runs on a host with cgroupv2 mounted.
Screenshots
N/A
Additional context
cgroupv2 is fairly useless for the one thing that is of interest to Galaxy admins: it has no measurement equivalent to cgroupv1's memory.(memsw.)max_usage_in_bytes (max memory usage recorded). Thus it is recommended that cgroupv2 be disabled and cgroupv1 be mounted instead. This can be done by setting the kernel options:
Typically this is achieved by appending the options to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, running update-grub as root, and rebooting. Of course, this won't be an option for people who do not admin their own clusters.
The text was updated successfully, but these errors were encountered:
It would appear that max memory usage was finally added at some point in cgroupsv2, as I discovered while checking why my openstack instances have suddenly stopped mounting cgroupsv1 (probably when I switched them from Rocky 8 to Rocky 9). The new file is memory.peak:
These instances run whatever the latest ELRepo kernel-ml is. As of the time of writing:
[root@js2-xl7 task_0]# uname -aLinux js2-xl7.novalocal 6.1.63-1.el9.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 20 11:32:53 EST 2023 x86_64 x86_64 x86_64 GNU/Linux
That said, this bug still applies - all my Slurm job scripts are exiting non-zero and Slurm thinks they have all failed, but since the only nodes I have running cgroupsv2 are via Pulsar, and Pulsar doesn't care about the DRM exit state, I didn't even notice until now.
Describe the bug
Newer Linuxes (Ubuntu 22.04+, EL9+) now mount cgroupv2 instead of cgroupv1. Cgroupv2 has a completely different hierarchy, so the plugin's path assumptions are wrong. Because the last statement in the job script is from the plugin and it returns non-zero, this fails the entire job for DRMs like Slurm that use the exit code of the job script to determine the job exit status.
Galaxy Version and/or server at which you observed the bug
Galaxy Version: all
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Jobs do not fail when the cgroup metrics plugin is enabled and the job runs on a host with cgroupv2 mounted.
Screenshots
N/A
Additional context
cgroupv2 is fairly useless for the one thing that is of interest to Galaxy admins: it has no measurement equivalent to cgroupv1's
memory.(memsw.)max_usage_in_bytes
(max memory usage recorded). Thus it is recommended that cgroupv2 be disabled and cgroupv1 be mounted instead. This can be done by setting the kernel options:Typically this is achieved by appending the options to
GRUB_CMDLINE_LINUX_DEFAULT
in/etc/default/grub
, runningupdate-grub
as root, and rebooting. Of course, this won't be an option for people who do not admin their own clusters.The text was updated successfully, but these errors were encountered: