Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert groups reappear with a new ID after a while #4998

Open
senpro-ingwersenk opened this issue Sep 9, 2024 · 0 comments
Open

Alert groups reappear with a new ID after a while #4998

senpro-ingwersenk opened this issue Sep 9, 2024 · 0 comments

Comments

@senpro-ingwersenk
Copy link

What went wrong?

What happened:
We use Grafana's provisioning to provision alerts from a private Git repository. Recently, since around 1.8.x, my collegues have reported from time to time, that alerts just completely re-appear and are now duplicates. Taking a look into the database itself, it seems that the alert group is completely re-created but matches the previous alerts.

Side-note: I "inherited" this setup from a former collegue who had less than five days to attempt to teach me how to manage and maintain this - and aside from me, nobody here really knows how to work with this kind of software (including MySQL and such), especially since it is all deployed in a Kubernetes (k3s) cluster. So, tl;dr: I have to somehow administer this all on my own. With no prior experience.

This is a screenshot of what we see:
Image

And this is what I see in the database:
mysql-output.txt

Here is the related provisioning snippet:

# alerts/group_msWinSrvDiskWarn.yaml
apiVersion: 1
groups:
    # (...)
    - orgId: 20
      name: Windows (Server)
      folder: Microsoft Corporation
      interval: 1m
      rules:
        - uid: wlzli-msWinSrvDiskWarn
          title: Disk Usage (Server) (Warning)
          condition: condition
          data:
            - refId: main
              relativeTimeRange:
                from: 21600
                to: 0
              datasourceUid: wlzli-microsoft
              model:
                datasource:
                    type: influxdb
                    uid: wlzli-microsoft
                intervalMs: 1000
                maxDataPoints: 43200
                query: "from(bucket: \"microsoft\")\r\n  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\r\n  |> filter(fn: (r) => r[\"_measurement\"] == \"wmi.Win32_Volume\")\r\n  |> filter(fn: (r) => r[\"_field\"] == \"FreeSpace\" or r[\"_field\"] == \"Capacity\")\r\n  |> filter(fn: (r) => r[\"type\"] == \"server\")\r\n  |> pivot(rowKey:[\"_time\"], columnKey: [\"_field\"], valueColumn: \"_value\")\r\n  |> map(fn: (r) => ({ r with _value: (float(v: r.Capacity) - float(v: r.FreeSpace)) / float(v: r.Capacity) * 100.0}))\r\n  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)\r\n  |> yield(name: \"mean\")"
                refId: main
            - refId: alert
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params: []
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - A
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: main
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: last
                refId: alert
                type: reduce
            - refId: condition
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 85
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - B
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: alert
                intervalMs: 1000
                maxDataPoints: 43200
                refId: condition
                type: threshold
          dashboardUid: ab310a98-5baf-4cda-bf4f-920bd19fbdc8
          panelId: 80
          noDataState: NoData
          execErrState: Error
          for: 5m
          annotations:
            __dashboardUid__: ab310a98-5baf-4cda-bf4f-920bd19fbdc8
            __panelId__: "80"
          labels:
            customer: wlzli
            severity: WARN
          isPaused: false

As you can see in the output, the same alert group gets recreated several times.

What did you expect to happen:
We expected that when adding a note to the alert group and choosing either acknowledge or Resolve, that the group would keep this state untill a new alert of the matching criteria is determined.

How do we reproduce it?

I am unfortunately not aware how this is reproducible - all I do know is that this issue started to happen after the first time I had upgraded OnCall to 1.8.x. Before each upgrade, I do read all the changelogs prior but found nothing - neither here nor in Grafanas' - that would indicate to me something that would have to be changed. Apologies for that. Though due to my visual impairment, I would not be surprised if I overlooked something - it happens sometimes...

Grafana OnCall Version

1.9.20

Product Area

Alert Flow & Configuration

Grafana OnCall Platform?

Kubernetes

User's Browser?

Happens in Chrome, Firefox and Edge and Brave.

Anything else to add?

We deploy Grafana and OnCall in separate deployments:

root@senst-sv-k3s01 ~# kubectl get -n grafana all
NAME                           READY   STATUS    RESTARTS      AGE
pod/grafana-5689bb9b5c-ssmc6   4/4     Running   1 (91m ago)   91m

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/grafana   ClusterIP   10.43.143.40   <none>        3000/TCP   418d

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana   1/1     1            1           418d

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/grafana-5689bb9b5c   1         1         1       91m
replicaset.apps/grafana-59578dfdbd   0         0         0       2d19h
replicaset.apps/grafana-5b7cdd696d   0         0         0       2d21h
replicaset.apps/grafana-5cf74df48b   0         0         0       2d19h
replicaset.apps/grafana-5dbb58c767   0         0         0       2d19h
replicaset.apps/grafana-6546ffbb6b   0         0         0       2d21h
replicaset.apps/grafana-67977b5965   0         0         0       4h9m
replicaset.apps/grafana-7944d879c9   0         0         0       2d19h
replicaset.apps/grafana-7d68c77b4c   0         0         0       2d19h
replicaset.apps/grafana-95bcf4444    0         0         0       2d19h
replicaset.apps/grafana-99c7957b8    0         0         0       2d19h

root@senst-sv-k3s01 ~# kubectl get -n oncall all
NAME                          READY   STATUS    RESTARTS        AGE
pod/oncall-6c84c58bc4-8bszx   5/5     Running   155 (46m ago)   6d18h

NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/oncall   ClusterIP   10.43.130.73   <none>        3306/TCP,5672/TCP,8080/TCP   378d

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/oncall   1/1     1            1           378d

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/oncall-54cd95b4dc   0         0         0       6d20h
replicaset.apps/oncall-587969d66f   0         0         0       9d
replicaset.apps/oncall-59dd5884df   0         0         0       75d
replicaset.apps/oncall-6476568b55   0         0         0       35d
replicaset.apps/oncall-6b5c4c87fb   0         0         0       6d19h
replicaset.apps/oncall-6c84c58bc4   1         1         1       6d18h
replicaset.apps/oncall-6d4ff5946c   0         0         0       34d
replicaset.apps/oncall-796fccc755   0         0         0       38d
replicaset.apps/oncall-79f6d85476   0         0         0       75d
replicaset.apps/oncall-7c486b8d59   0         0         0       75d
replicaset.apps/oncall-844fc69fc7   0         0         0       75d

The OnCall deployment bundles RabbitMQ, MySQL and Redis while Grafana only has Postgres bundled.

A Helm chart is not ued - my former collegue wrote those by hand (and it shows...) so we update the versions manually by changing the version tag on the images. The cluster is built on three nodes.

I hope I provided all the information that there is - I looked around more but couldn't find anything there. Sorry if I oversaw something or forgot to add - I'm trying my est to work with this situation I am in. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant