Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling the cgroup metrics plugin on systems with cgroupv2 causes jobs to fail #15924

Closed
natefoo opened this issue Apr 11, 2023 · 2 comments
Closed

Comments

@natefoo
Copy link
Member

natefoo commented Apr 11, 2023

Describe the bug
Newer Linuxes (Ubuntu 22.04+, EL9+) now mount cgroupv2 instead of cgroupv1. Cgroupv2 has a completely different hierarchy, so the plugin's path assumptions are wrong. Because the last statement in the job script is from the plugin and it returns non-zero, this fails the entire job for DRMs like Slurm that use the exit code of the job script to determine the job exit status.

Galaxy Version and/or server at which you observed the bug
Galaxy Version: all

To Reproduce
Steps to reproduce the behavior:

  1. Install Galaxy on Ubuntu 22.04
  2. Enable the cgroup metrics plugin
  3. Configure Galaxy to run jobs on Slurm
  4. Run a job
  5. Error

Expected behavior
Jobs do not fail when the cgroup metrics plugin is enabled and the job runs on a host with cgroupv2 mounted.

Screenshots
N/A

Additional context
cgroupv2 is fairly useless for the one thing that is of interest to Galaxy admins: it has no measurement equivalent to cgroupv1's memory.(memsw.)max_usage_in_bytes (max memory usage recorded). Thus it is recommended that cgroupv2 be disabled and cgroupv1 be mounted instead. This can be done by setting the kernel options:

systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1

Typically this is achieved by appending the options to GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, running update-grub as root, and rebooting. Of course, this won't be an option for people who do not admin their own clusters.

@natefoo
Copy link
Member Author

natefoo commented Dec 4, 2023

It would appear that max memory usage was finally added at some point in cgroupsv2, as I discovered while checking why my openstack instances have suddenly stopped mounting cgroupsv1 (probably when I switched them from Rocky 8 to Rocky 9). The new file is memory.peak:

[root@js2-xl7 task_0]# cat /sys/fs/cgroup/system.slice/slurmstepd.scope/job_1123053/step_batch/user/task_0/memory.peak
59930103808

These instances run whatever the latest ELRepo kernel-ml is. As of the time of writing:

[root@js2-xl7 task_0]# uname -a
Linux js2-xl7.novalocal 6.1.63-1.el9.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 20 11:32:53 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

That said, this bug still applies - all my Slurm job scripts are exiting non-zero and Slurm thinks they have all failed, but since the only nodes I have running cgroupsv2 are via Pulsar, and Pulsar doesn't care about the DRM exit state, I didn't even notice until now.

@natefoo
Copy link
Member Author

natefoo commented Oct 1, 2024

Fixed in #17169

@natefoo natefoo closed this as completed Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant