Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: drain and volume detachment status conditions #1876

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jmdeal
Copy link
Member

@jmdeal jmdeal commented Dec 11, 2024

Fixes #N/A

Description
Adds status conditions for node drain and volume detachment to improve observability for the individual termination stages. This is a scoped down version of #1837, which takes these changes along with splitting each termination stage into a separate controller. I will continue to work on that refactor, but I'm decoupling to work on higher priority work.

Status Conditions:

Condition Unknown False True
Drained Karpenter hasn't attempted to drain the Node yet Karpenter hasn't completed draining the node (may be blocked by PDBs, do-not-disrupt, etc.). Karpenter will not proceed to instance termination when Drained is in this state. Karpenter has successfully drained the node and will proceed with the termination flow.
VolumesDetached Karpenter hasn't checked the node for volume attachments yet. This won't transition out of unknown until Drained transitions to true. Not all blocking volume attachment objects have been deleted. If TGP is set, Karpenter will be able to proceed with termination in this state but that will be indicated in the reason. All blocking volume attachment objects have been deleted.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2024
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 4bb4d97 to fb3ac47 Compare December 11, 2024 20:31
@coveralls
Copy link

coveralls commented Dec 11, 2024

Pull Request Test Coverage Report for Build 12982696607

Details

  • 71 of 122 (58.2%) changed or added relevant lines in 4 files are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 80.979%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/node/node.go 0 5 0.0%
pkg/controllers/node/termination/controller.go 60 106 56.6%
Files with Coverage Reduction New Missed Lines %
pkg/test/expectations/expectations.go 2 94.81%
pkg/controllers/node/termination/controller.go 3 63.52%
Totals Coverage Status
Change from base Build 12954850842: -0.1%
Covered Lines: 9111
Relevant Lines: 11251

💛 - Coveralls

@engedaam
Copy link
Contributor

/assign @engedaam

pkg/controllers/node/termination/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/termination/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/termination/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/termination/controller.go Outdated Show resolved Hide resolved
pkg/controllers/node/termination/controller.go Outdated Show resolved Hide resolved
Copy link

github-actions bot commented Jan 2, 2025

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2025
@jmdeal
Copy link
Member Author

jmdeal commented Jan 11, 2025

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch 2 times, most recently from b527992 to 21176e1 Compare January 15, 2025 20:02
@jmdeal
Copy link
Member Author

jmdeal commented Jan 15, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
@engedaam
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam, jmdeal
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from d536a96 to 43949ef Compare January 16, 2025 17:21
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2025
if err != nil {
return reconcile.Result{}, fmt.Errorf("listing nodeclaims, %w", err)
if nodeutils.IsDuplicateNodeClaimError(err) || nodeutils.IsNodeClaimNotFoundError(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should we throw a comment over this one that indicates that we don't expect this case to happen and (if it does) then we expect that something has gone wrong and we have broken some tenant of the system?

}

if err = c.deleteAllNodeClaims(ctx, nodeClaims...); err != nil {
return reconcile.Result{}, fmt.Errorf("deleting nodeclaims, %w", err)
// If the underlying NodeClaim no longer exists, we want to delete to avoid trying to gracefully drain nodes that are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me again: I recall there was a bug and a reason that we moved this up -- something with us getting stuck on the terminationGracePeriod and continually trying to drain even if the instance was already terminated, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we were only checking this in the drain logic, if we drained but were stuck awaiting volume attachments we never hit this check and could get stuck indefinitely. I don't think there was any interaction with terminationGracePeriod, if anything it would save users in that case.

if err = c.terminator.Taint(ctx, node, v1.DisruptedNoScheduleTaint); err != nil {
if errors.IsConflict(err) {
if errors.IsConflict(err) || errors.IsNotFound(err) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node is no longer found, why why would we choose to requeue in that case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this came out of a previous review, this deduplicated the Node not found logic by just relying on the check at the top, at the cost of an extra reconcile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right -- it just seems odd because it's completely counter to every other controller where we perform this logic -- seems odd to not handle it because the error check is the same anyways and the check is free

if cloudprovider.IsNodeClaimNotFoundError(err) {
return reconcile.Result{}, c.removeFinalizer(ctx, node)
stored := nodeClaim.DeepCopy()
if modified := nodeClaim.StatusConditions().SetFalse(v1.ConditionTypeDrained, "Draining", "Draining"); modified {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be Drained (Unknown) here since we are in the process of draining but we haven't completed our drain logic -- at which point we would mark the status as Drained=true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is at this point we know it is not drained, since it's in the process of draining. Whereas before we do the check we don't know if there are any drainable pods on the node, so drained is unknown.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #1876 (comment) but I personally disagree with this framing -- personally, I think we should have done InstanceTerminated and then gone from Unknown to True/False as well there -- transitioning from True -> False or False -> True for a terminal status condition in general is a little odd because it suggests that the process has finished when in fact it hasn't

} else if !c.hasTerminationGracePeriodElapsed(nodeTerminationTime) {
c.recorder.Publish(terminatorevents.NodeAwaitingVolumeDetachmentEvent(node))
stored := nodeClaim.DeepCopy()
if modified := nodeClaim.StatusConditions().SetFalse(v1.ConditionTypeVolumesDetached, "AwaitingVolumeDetachment", "AwaitingVolumeDetachment"); modified {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this also be setting to Unknown since we are in the process of Detaching the volumes so it hasn't hit a terminal state yet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Drained, I think it's more clear to set it to False here since we know the volumes aren't detached. Unknown indicates to me that we don't know one way or the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is counter to how we have been treating status conditions throughout the project -- we are indicating with status conditions that a process hasn't completed and we don't know whether it's going to succeed or fail (for instance, the Completed status condition for a job doesn't go into a False state while the job is running, it stays in Unknown because we don't know if the job is going to complete or not and then transitions to True/False based on whether it entered a terminal state or not

return reconcile.Result{RequeueAfter: 1 * time.Second}, nil
} else {
stored := nodeClaim.DeepCopy()
if modified := nodeClaim.StatusConditions().SetFalse(v1.ConditionTypeVolumesDetached, "TerminationGracePeriodElapsed", "TerminationGracePeriodElapsed"); modified {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see us setting it to False here since this indicates that we failed to attach the volumes and we had to terminate due to hitting our terminationGracePeriod on the node

// 404 = the nodeClaim no longer exists
if errors.IsNotFound(err) {
continue
if volumesDetached {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to cleanup some of this logic so that it's not so nested -- comments might also help too -- the fact that we fall-through when volumes are detached and are able to get to the bottom of the function (we don't requeue) is a bit confusing IMO

InvolvedObject: node,
Type: corev1.EventTypeNormal,
Reason: "AwaitingVolumeDetachment",
Message: "Awaiting deletion VolumeAttachments bound to node",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it useful to list out the volume detachments that we are waiting on (or maybe a pretty list of them) in this list that we have here

Copy link
Member Author

@jmdeal jmdeal Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't for a similar reason that I didn't include the pods on the Drained status condition - we would either be hammering the API server as the list changes or the information would be out of date frequently. The former is a non-starter IMO, and the latter makes me feel like it isn't worth it. I think we can add a troubleshooting entry in the docs for how to find blocking volume attachments if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hammering the API server as the list changes or the information would be out of date frequently

You can make it so that the events are only fired at a certain frequency and they are deduped without considering their message -- honestly, I think including the extra info could be helpful and, if there are ones that are actually stuck, would be really valuable information for a user to know

@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 43949ef to 4ed938c Compare January 27, 2025 05:33
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants