Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems arising from deleting and rapidly creating a subscription #3010

Open
dtfranz opened this issue Aug 14, 2023 · 5 comments · May be fixed by #3483
Open

Problems arising from deleting and rapidly creating a subscription #3010

dtfranz opened this issue Aug 14, 2023 · 5 comments · May be fixed by #3483
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@dtfranz
Copy link
Contributor

dtfranz commented Aug 14, 2023

Bug Report

Description

I've noticed that occasionally, after a subscription is updated, then deleted and immediately recreated, the newly created subscription will be updated with the status of the old, deleted subscription. This will halt installation within the namespace, as the status will link the new subscription to the installPlan which was garbage-collected as a result of deleting the original subscription.

Workaround

The issue can be resolved by simply deleting the subscription, then re-creating it after giving the controllers enough time to register the deletion event. Creating it with a different name should also ensure that the issue doesn't happen at all.

Possible Cause

I believe this occurs because items in the cache are keyed by namespace/name, and it may therefore be possible for a controller to update the new subscription with an old status using a stale entry from the cache.

Example

Following is an example of a subscription in this state. Note that no CSVs or InstallPlans were present in the namespace at the time. This was reproduced in an OpenShift 4.13 cluster with a catalog-operator image built from this repo as of commit hash 2be5e58:

apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  creationTimestamp: "2023-08-10T18:23:19Z"
  generation: 1
  labels:
    operators.coreos.com/project-quay.openshift-operators: ""
  name: project-quay
  namespace: openshift-operators
  resourceVersion: "334099"
  uid: cab0d6e8-a551-4e39-ad2e-2f3c0a2caf27
spec:
  channel: stable-3.6
  installPlanApproval: Automatic
  name: project-quay
  source: community-operators
  sourceNamespace: openshift-marketplace
status:
  catalogHealth:
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: certified-operators
      namespace: openshift-marketplace
      resourceVersion: "307403"
      uid: b1f09bd9-df57-4b1a-8520-a70f2038886b
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: community-operators
      namespace: openshift-marketplace
      resourceVersion: "319530"
      uid: b3de47e6-43e6-4436-af68-ebaed5a8a7cd
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: redhat-marketplace
      namespace: openshift-marketplace
      resourceVersion: "318894"
      uid: e018a243-535b-4a4e-bec8-d3350a17eded
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  - catalogSourceRef:
      apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: redhat-operators
      namespace: openshift-marketplace
      resourceVersion: "324137"
      uid: 007734c1-e24e-4b4b-8edd-6e29e84dc5d3
    healthy: true
    lastUpdated: "2023-08-10T18:22:53Z"
  conditions:
  - lastTransitionTime: "2023-08-10T18:22:53Z"
    message: all available catalogsources are healthy
    reason: AllCatalogSourcesHealthy
    status: "False"
    type: CatalogSourcesUnhealthy
  - status: "False"
    type: BundleUnpacking
  - message: 'constraints not satisfiable: subscription project-quay requires @existing/openshift-operators//quay-operator.v3.8.10,
      subscription project-quay exists, clusterserviceversion quay-operator.v3.7.11
      exists and is not referenced by a subscription, @existing/openshift-operators//quay-operator.v3.8.10
      and @existing/openshift-operators//quay-operator.v3.7.11 originate from package
      project-quay'
    reason: ConstraintsNotSatisfiable
    status: "True"
    type: ResolutionFailed
  - lastTransitionTime: "2023-08-10T18:23:20Z"
    reason: ReferencedInstallPlanNotFound
    status: "True"
    type: InstallPlanMissing
  currentCSV: quay-operator.v3.8.10
  installPlanGeneration: 3
  installPlanRef:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-nftdc
    namespace: openshift-operators
    resourceVersion: "333761"
    uid: 5d74d8cb-8170-438c-91ed-3d7d3e44f4ed
  installedCSV: quay-operator.v3.8.10
  installplan:
    apiVersion: operators.coreos.com/v1alpha1
    kind: InstallPlan
    name: install-nftdc
    uuid: 5d74d8cb-8170-438c-91ed-3d7d3e44f4ed
  lastUpdated: "2023-08-10T18:23:20Z"
  state: UpgradePending

Impact

While the impact may be high when this occurs it should be fairly unlikely to happen given the speed that's required when deleting and re-creating the subscription.

Resolution

In the worst-case, this may require re-architecting the internal cache implementation of OLM to make use of UIDs instead of relying on the namespace and name of objects alone. We may also be able to do a UID comparison before doing a status update, but I haven't looked into this very much.

@Elyytscha
Copy link

just to note, with argocd this happens quite often, appearantly once in a week i see broken operator subscriptions with this issue, which means all other subscriptions are blocked.. my current workaround is to delete all pods.jobs in the olm namespace, then delete a csv,installplan for a subscription, after this all subscriptions are finding somehow back to a working state.. i saw this with 0.26 and with 0.25 on gke 1.27

@aceat64
Copy link

aceat64 commented Jan 9, 2024

I'm also running into this issue fairly often with ArgoCD. I turned off autosync/selfheal, but it cropped up again on one of my clusters.

@ciiiii
Copy link

ciiiii commented Mar 12, 2024

encountering same issue, it always breaks automation deployment

@perdasilva
Copy link
Collaborator

It should be noted that, even in the absence of this bug, deleting and re-creating the Subscription will inevitably end in a Resolution error due to the CSV being present (Subscription deletion does not lead to CSV deletion). See https://olm.operatorframework.io/docs/troubleshooting/subscription/#a-subscription-fails-because-i-deleted-a-similar-subscription-and-left-the-csv-it-installed for more info.

@perdasilva perdasilva mentioned this issue Dec 15, 2024
11 tasks
@stevekuznetsov
Copy link
Member

If the client updating the Subscription correctly used a resourceVersion precondition on the Update call it would detect that the object it's trying to update is not the same one it saw at the outset and bail out. That would help make sure the controller is not updating a new object using a stale understanding of the world (even if no deletion had occurred)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants