-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bottlerocket 1.25 is preventing nvidia driver to start #4250
Bottlerocket 1.25 is preventing nvidia driver to start #4250
Comments
Thank you for the report. I will try your reproduction recipe and let you know what I see. |
My first attempt to reproduce this failed (all instances had a functioning NVIDIA driver), which doesn't say very much if (as seems very likely) this is a race condition between driver load and kubelet start. I will keep looking. One mildly troubling indication is that I do not see any systemd scheduling that ensures that driver load occurs before kubelet start. If you would like to dig deeper yourself, you might look at this on a failing node:
This is the systemd service that loads the NVIDIA driver kmod on a G5 (or similar) instance. If this happens before kubelet start on successful boots and after kubelet start on failed boots, we will have found the race condition you suspected. |
I used
This should ensure that the driver is present before kubelet starts, so at least for now I would rule out a race between these two systemd services. |
Any configuration details you could share would be helpful in reproducing this problem. It would be interesting to see the device plugin logging, since that's the source of the kubelet error message you provide. This is what I see on my instances:
There is one possibly-relevant change in 1.25.0. We added support for GPU time-slicing (as seen in #4230). Normally I would expect the |
Hello again, and thanks for investigating this!
I'm keeping that node around in case you have other ideas of logs that could help. Thanks again! |
@larvacea, we also added @nikoul, in the failing instance, could you please check the journal for any other errors: journalctl -p4 |
@arnaldo2792 Of course, here is the error log from
|
I thought I had a Karpenter issue following upgrade to v1 but I might be having exactly the same issue: aws/karpenter-provider-aws#7046 (comment) |
Oh, thanks for sharing. Yeah, if you don't have the issue with 1.24, then it's very likely to be that same issue. |
From the logs, this seems like the problematic sequence:
For affected nodes, Specifically, this inotify event firing appears to cause the trouble:
|
Also of interest, kubelet's plugin manager starts after the FS watcher starts:
Since the FS watcher watches the directory, it will see the initial |
Thanks for the notes @bcressey. I'm working on changes to ensure |
I just had the same issue on |
My understanding of the bug is that there's nothing particular about As an update, I'm in the process of testing my changes right now. Hoping to have a PR out with the changes soon. |
Sorry for the delay here, the race condition was so rare in my testing environment that it was challenging to prove that we actually resolved the issue. After more thorough investigation, there are several unfortunate behaviors that contribute to this bug. In this case:
In practice, we can always avoid the bug by refraining from starting The issues caused by the device plugin restarts definitely require further investigation though. |
bottlerocket-os/bottlerocket-core-kit#228 is merged, which should resolve this in an upcoming Bottlerocket release! @bcressey has done some great work looking into why It seems that you can reliably "break" the exposure of GPUs via the device plugin by doing something like this: for i in {1..100} ; do
echo $i
kill -sHUP $(pgrep -f nvidia-device-plugin)
sleep 0.2
(journalctl -t kubelet -xe|tail -n1|grep 'client connection') && break
done This is because For some background, you can find more on how device plugins are designed here. Ok, so how do restarts work?
So what is likely happening here is that kubelet is likely trying to continue the "old" session after I'll keep this issue open while:
|
thank you very much for the investigation and the upcoming fix! |
I got this same error too when I describe nodeclaim. I get this every time when time-slicing is enabled , nodeclaim state is Also, while node is in ready state, pod which requested "nvidia.com/gpu: 1" still remains in pending state and not being scheduled. Once i delete the pod and quickly re-apply, pod get scheduled on that node and works but nodeclaim still in And if i delete the pod and wait for karpenter to consolidate that node, it won't happen and I would need to manually delete the nodeclaim to cleanup that node. If time-slicing is not enabled, I haven't seen this issue. Bottlerocket version: Bottlerocket OS 1.26.1 (aws-k8s-1.30-nvidia) |
Thank for new Bottlerocket version. I tried again with latest Bottlerocket : Bottlerocket OS 1.26.2 (aws-k8s-1.30-nvidia) and behavior persist. I am using Karpenter to launch nodes based on below resource requirement in a pod and that triggers Karpenter to launch a GPU node. Node becomes ready quickly but nodeclaim always stays in Note: We have time-slicing configured. NodeClaim Error: Resource:
My assumption is Karpenter working as expected and launches the instance for resource Events:
The new pods with below resource requirements will schedule just fine on same node as shared capacity is available and advertised. Resource:
I had to delete the original pod and re-created with shared requirement and it works fine after that. Issue is nodeclaim is EDIT: I was able to fix my issue by not renaming the resource name from |
I got a nodeclaim stuck in Karpenter and event log:
nodeclaim.yaml (right after I requested its deletion)apiVersion: karpenter.sh/v1
kind: NodeClaim
metadata:
annotations:
compatibility.karpenter.k8s.aws/cluster-name-tagged: "true"
compatibility.karpenter.k8s.aws/kubelet-drift-hash: "15379597991425564585"
karpenter.k8s.aws/ec2nodeclass-hash: "6440581379273964080"
karpenter.k8s.aws/ec2nodeclass-hash-version: v3
karpenter.k8s.aws/tagged: "true"
karpenter.sh/nodepool-hash: "13389053327402833262"
karpenter.sh/nodepool-hash-version: v3
karpenter.sh/stored-version-migrated: "true"
creationTimestamp: "2024-11-14T10:02:24Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2024-11-15T09:25:36Z"
finalizers:
- karpenter.sh/termination
generateName: mynodepool-
generation: 2
labels:
karpenter.k8s.aws/instance-category: g
karpenter.k8s.aws/instance-cpu: "16"
karpenter.k8s.aws/instance-cpu-manufacturer: amd
karpenter.k8s.aws/instance-ebs-bandwidth: "4750"
karpenter.k8s.aws/instance-encryption-in-transit-supported: "true"
karpenter.k8s.aws/instance-family: g5
karpenter.k8s.aws/instance-generation: "5"
karpenter.k8s.aws/instance-gpu-count: "1"
karpenter.k8s.aws/instance-gpu-manufacturer: nvidia
karpenter.k8s.aws/instance-gpu-memory: "24576"
karpenter.k8s.aws/instance-gpu-name: a10g
karpenter.k8s.aws/instance-hypervisor: nitro
karpenter.k8s.aws/instance-local-nvme: "600"
karpenter.k8s.aws/instance-memory: "65536"
karpenter.k8s.aws/instance-network-bandwidth: "10000"
karpenter.k8s.aws/instance-size: 4xlarge
karpenter.sh/capacity-type: on-demand
karpenter.sh/nodepool: mynodepool
kubernetes.io/arch: amd64
kubernetes.io/os: linux
node-role: training
node.kubernetes.io/instance-type: g5.4xlarge
nvidia.com/gpu: A10G
topology.k8s.aws/zone-id: euw1-az1
topology.kubernetes.io/region: eu-west-1
topology.kubernetes.io/zone: eu-west-1a
name: mynodepool-btjcg
ownerReferences:
- apiVersion: karpenter.sh/v1
blockOwnerDeletion: true
kind: NodePool
name: mynodepool
uid: 6bd5acad-c9b4-4350-a369-b151eb571089
resourceVersion: "460997815"
uid: 8a2001a9-af0d-4006-8741-1d515f993d70
spec:
expireAfter: Never
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: large-disk
requirements:
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- "60000"
- key: node.kubernetes.io/instance-type
operator: In
values:
- g5.4xlarge
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: karpenter.sh/nodepool
operator: In
values:
- mynodepool
- key: nvidia.com/gpu
operator: In
values:
- A10G
- key: karpenter.k8s.aws/instance-family
operator: In
values:
- g5
resources:
requests:
cpu: 210m
memory: 240Mi
nvidia.com/gpu: "1"
pods: "9"
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: "true"
status:
allocatable:
cpu: 15890m
ephemeral-storage: "538926258176"
memory: 57691Mi
nvidia.com/gpu: "1"
pods: "234"
vpc.amazonaws.com/pod-eni: "34"
capacity:
cpu: "16"
ephemeral-storage: 600G
memory: 60620Mi
nvidia.com/gpu: "1"
pods: "234"
vpc.amazonaws.com/pod-eni: "34"
conditions:
- lastTransitionTime: "2024-11-14T10:12:26Z"
message: ""
reason: ConsistentStateFound
status: "True"
type: ConsistentStateFound
- lastTransitionTime: "2024-11-14T10:03:07Z"
message: Resource "nvidia.com/gpu" was requested but not registered
reason: ResourceNotRegistered
status: Unknown
type: Initialized
- lastTransitionTime: "2024-11-15T09:25:37Z"
message: ""
reason: InstanceTerminating
status: "True"
type: InstanceTerminating
- lastTransitionTime: "2024-11-14T10:02:26Z"
message: ""
reason: Launched
status: "True"
type: Launched
- lastTransitionTime: "2024-11-14T10:02:49Z"
message: Initialized=Unknown
reason: UnhealthyDependents
status: Unknown
type: Ready
- lastTransitionTime: "2024-11-14T10:02:49Z"
message: ""
reason: Registered
status: "True"
type: Registered
imageID: ami-06455fd9d0e2a0590
nodeName: i-0bb15c620a9cf7aba.eu-west-1.compute.internal
providerID: aws:///eu-west-1a/i-0bb15c620a9cf7aba |
Thanks @apjneeraj for following up on timeslicing being an issue. For this particular issue, we are trying to track down issues with the GPU nodes not becoming ready when not using timeslicing. If you find more issues related to timeslicing, please cut us a new issue so we can track it separately! |
Thanks @awoimbee, we thought we had fixed this in 1.26.2 but your report helps us dive into what we need to do next. It appears that our original fix did not resolve the root cause. I'm taking a closer look and will report back with what I find. |
We got the same issue, some of our GPU nodes (g4dn.xlarge) are failing to initialize the NVIDIA driver on Kubernetes 1.29. The affected image is (same as first post) We are using Karpenter to provision our nodes and got the same error.
|
After digging in, I found that I mistakenly failed to tag my fix into the Apologies for the mixup. For anyone still experiencing this issue, there's a good chance that it is resolved by Bottlerocket @cogentist-yann's report using |
I have attempted to replicate this behavior on For what it's worth, this setup does capture the failures on @cogentist-yann I'm wondering if there's some facet that I'm missing in the reproducer. If you happen to hit the issue, do you mind checking the systemd journal for errors that look like the ones @bcressey called out in this comment? Are you enabling any additional Bottlerocket settings deploying any additional Kubernetes services which are related to GPU configuration and scheduling? |
@cbgbt the ami seems different, it was Here the provisioner used for karpenter
I will update this message Monday. UPDATE: @cbgbt |
Thanks @cogentist-yann. I'll run my tests with the specific AMI you've mentioned as well. Otherwise, I believe this issue is resolved, but I'll leave it open for a while in case there is new data. The reported failure on |
I've run some tests on my end as well with |
Description
We have been using Bottlerocket 1.25 for the past 3 days. Since the upgrade, some of our GPU nodes (
g5.xlarge
) are failing to initialize the NVIDIA driver on Kubernetes 1.30. The affected image isbottlerocket-aws-k8s-1.30-nvidia-x86_64-v1.25.0-388e1050
.Issue
When the driver fails to initialize, the node does not advertise its
nvidia.com/gpu
capacity, causing the corresponding pods to remain in a Pending state indefinitely. This prevents any pods scheduled on the affected node from starting.The kubelet log on the admin container of an affected node contains the following error:
Additional Information
Unknown
state indefinitely and shows the following condition message:Expected Behavior
The NVIDIA driver should initialize correctly on all GPU nodes, and the node should advertise its GPU capacity to ensure proper scheduling.
Steps to Reproduce
g5.xlarge
instances in a Kubernetes 1.30 cluster.The text was updated successfully, but these errors were encountered: