Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #857

Open
henryli001 opened this issue Jan 11, 2025 · 1 comment
Assignees

Comments

@henryli001
Copy link

Running standalone docker container in an Azure Linux 2.0 VM with nvidia container toolkit installed will lose access to the GPU and throw the error: "Failed to initialize NVML: Unknown Error" after the container is running for a while. The symptom is similar to that described in this known issue: #48 and can be reproed by running systemctl daemon-reload.

The issue would not show up if I explicitly set --device= for each NVIDIA device node in my system in the docker command. However, this is not a sustainable solution as the number of NVIDIA device nodes in the system may change based on the configuration and thus I'm wondering if there's a better way to let the container automatically access all the NVIDIA devices without explicitly setting --device= for each NVIDIA device node?

@elezar
Copy link
Member

elezar commented Jan 14, 2025

@henryli001 as called out in #48 one other option is to use cgroupfs as the cgroup driver instead of systemd.

Note that using CDI to request devices should also address this problem, as the cgroups are updated by runc (or another low-level runtime) instead of the nvidia-container-runtime-hook.

Which Docker version are you using, and how do you typically launch containers?

@elezar elezar self-assigned this Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants