-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Erroneously Stuck in Failed State #2146
Comments
Ah, subresources... you can work around this with:
|
Same thing is happening to us, kubernetes-sigs/cluster-api#10991 (comment) Some little transient problems with the OpenStack API resulting in permanently failed clusters is quite annoying, CAPO shouldn't set these fields if the errors aren't terminal. And, to be honest, what kind of failures are terminal? Maybe "couldn't (re)-allocate specified loadbalancer IP", but I can't think of anything more. |
I'm seeing similar problems with
cluster comes up eventually, so it's treated correctly as transient by CAPO,, but it's stuck constantly broken in the CAPI bit |
As we're running our own operator on top of this, we're patching this ourselves; if the CAPI cluster has these fields but the CAPO one doesn't, we remove it from the status ourselves But it would be great if this would be addressed |
Similar to OP, various transient network errors result in this state:
With CAPI 1.7.4, patching the subresource doesn't remove the error and the cluster remains in a 'Failed' phase. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/kind bug
What steps did you take and what happened:
Just checking the state of things in ArgoCD and noted my cluster was in the red. Boo! On further inspection I can see:
but there is no such failure message attached to the OSC resource, so I'm figuring CAPO did sort itself out eventually. I'll just edit the resource, says I, and set the phase (didn't Kubernetes deem such things in the API a total fail?) back to
Provisioned
and huzzah. But that didn't work and it magically re-appeared from somewhere, I have no idea how this is even possible, but I digress...According to kubernetes-sigs/cluster-api#10847 CAPO should only ever set these things if something is terminal, and DNS failure quite frankly isn't, specially if you are a road warrior, living Max Max style like some Antipodean Adonis where Wifi is always up and down.
What did you expect to happen:
Treat this error as transient.
Anything else you would like to add:
Just basically reaching out for discussion before I delve into the code, it may be known about, fixed. As always you may have opinions on how this could be fixed. Logically:
should be the simple solution, depending on how well errors are propagated from Gophercloud, which is another story entirely.
Environment:
git rev-parse HEAD
if manually built): 0.10.3kubectl version
): n/a/etc/os-release
): n/aThe text was updated successfully, but these errors were encountered: