-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems arising from deleting and rapidly creating a subscription #3010
Comments
just to note, with argocd this happens quite often, appearantly once in a week i see broken operator subscriptions with this issue, which means all other subscriptions are blocked.. my current workaround is to delete all pods.jobs in the olm namespace, then delete a csv,installplan for a subscription, after this all subscriptions are finding somehow back to a working state.. i saw this with 0.26 and with 0.25 on gke 1.27 |
I'm also running into this issue fairly often with ArgoCD. I turned off autosync/selfheal, but it cropped up again on one of my clusters. |
encountering same issue, it always breaks automation deployment |
It should be noted that, even in the absence of this bug, deleting and re-creating the Subscription will inevitably end in a Resolution error due to the CSV being present (Subscription deletion does not lead to CSV deletion). See https://olm.operatorframework.io/docs/troubleshooting/subscription/#a-subscription-fails-because-i-deleted-a-similar-subscription-and-left-the-csv-it-installed for more info. |
If the client updating the Subscription correctly used a resourceVersion precondition on the Update call it would detect that the object it's trying to update is not the same one it saw at the outset and bail out. That would help make sure the controller is not updating a new object using a stale understanding of the world (even if no deletion had occurred) |
Bug Report
Description
I've noticed that occasionally, after a subscription is updated, then deleted and immediately recreated, the newly created subscription will be updated with the status of the old, deleted subscription. This will halt installation within the namespace, as the status will link the new subscription to the installPlan which was garbage-collected as a result of deleting the original subscription.
Workaround
The issue can be resolved by simply deleting the subscription, then re-creating it after giving the controllers enough time to register the deletion event. Creating it with a different name should also ensure that the issue doesn't happen at all.
Possible Cause
I believe this occurs because items in the cache are keyed by
namespace/name
, and it may therefore be possible for a controller to update the new subscription with an old status using a stale entry from the cache.Example
Following is an example of a subscription in this state. Note that no CSVs or InstallPlans were present in the namespace at the time. This was reproduced in an OpenShift 4.13 cluster with a catalog-operator image built from this repo as of commit hash
2be5e58
:Impact
While the impact may be high when this occurs it should be fairly unlikely to happen given the speed that's required when deleting and re-creating the subscription.
Resolution
In the worst-case, this may require re-architecting the internal cache implementation of OLM to make use of UIDs instead of relying on the namespace and name of objects alone. We may also be able to do a UID comparison before doing a status update, but I haven't looked into this very much.
The text was updated successfully, but these errors were encountered: