Alert groups reappear with a new ID after a while #4998

senpro-ingwersenk · 2024-09-09T08:55:57Z

What went wrong?

What happened:
We use Grafana's provisioning to provision alerts from a private Git repository. Recently, since around 1.8.x, my collegues have reported from time to time, that alerts just completely re-appear and are now duplicates. Taking a look into the database itself, it seems that the alert group is completely re-created but matches the previous alerts.

Side-note: I "inherited" this setup from a former collegue who had less than five days to attempt to teach me how to manage and maintain this - and aside from me, nobody here really knows how to work with this kind of software (including MySQL and such), especially since it is all deployed in a Kubernetes (k3s) cluster. So, tl;dr: I have to somehow administer this all on my own. With no prior experience.

This is a screenshot of what we see:

And this is what I see in the database:
mysql-output.txt

Here is the related provisioning snippet:

# alerts/group_msWinSrvDiskWarn.yaml
apiVersion: 1
groups:
    # (...)
    - orgId: 20
      name: Windows (Server)
      folder: Microsoft Corporation
      interval: 1m
      rules:
        - uid: wlzli-msWinSrvDiskWarn
          title: Disk Usage (Server) (Warning)
          condition: condition
          data:
            - refId: main
              relativeTimeRange:
                from: 21600
                to: 0
              datasourceUid: wlzli-microsoft
              model:
                datasource:
                    type: influxdb
                    uid: wlzli-microsoft
                intervalMs: 1000
                maxDataPoints: 43200
                query: "from(bucket: \"microsoft\")\r\n  |> range(start: v.timeRangeStart, stop: v.timeRangeStop)\r\n  |> filter(fn: (r) => r[\"_measurement\"] == \"wmi.Win32_Volume\")\r\n  |> filter(fn: (r) => r[\"_field\"] == \"FreeSpace\" or r[\"_field\"] == \"Capacity\")\r\n  |> filter(fn: (r) => r[\"type\"] == \"server\")\r\n  |> pivot(rowKey:[\"_time\"], columnKey: [\"_field\"], valueColumn: \"_value\")\r\n  |> map(fn: (r) => ({ r with _value: (float(v: r.Capacity) - float(v: r.FreeSpace)) / float(v: r.Capacity) * 100.0}))\r\n  |> aggregateWindow(every: v.windowPeriod, fn: mean, createEmpty: false)\r\n  |> yield(name: \"mean\")"
                refId: main
            - refId: alert
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params: []
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - A
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: main
                intervalMs: 1000
                maxDataPoints: 43200
                reducer: last
                refId: alert
                type: reduce
            - refId: condition
              datasourceUid: __expr__
              model:
                conditions:
                    - evaluator:
                        params:
                            - 85
                        type: gt
                      operator:
                        type: and
                      query:
                        params:
                            - B
                      reducer:
                        params: []
                        type: last
                      type: query
                datasource:
                    type: __expr__
                    uid: __expr__
                expression: alert
                intervalMs: 1000
                maxDataPoints: 43200
                refId: condition
                type: threshold
          dashboardUid: ab310a98-5baf-4cda-bf4f-920bd19fbdc8
          panelId: 80
          noDataState: NoData
          execErrState: Error
          for: 5m
          annotations:
            __dashboardUid__: ab310a98-5baf-4cda-bf4f-920bd19fbdc8
            __panelId__: "80"
          labels:
            customer: wlzli
            severity: WARN
          isPaused: false

As you can see in the output, the same alert group gets recreated several times.

What did you expect to happen:
We expected that when adding a note to the alert group and choosing either acknowledge or Resolve, that the group would keep this state untill a new alert of the matching criteria is determined.

How do we reproduce it?

I am unfortunately not aware how this is reproducible - all I do know is that this issue started to happen after the first time I had upgraded OnCall to 1.8.x. Before each upgrade, I do read all the changelogs prior but found nothing - neither here nor in Grafanas' - that would indicate to me something that would have to be changed. Apologies for that. Though due to my visual impairment, I would not be surprised if I overlooked something - it happens sometimes...

Grafana OnCall Version

1.9.20

Product Area

Alert Flow & Configuration

Grafana OnCall Platform?

Kubernetes

User's Browser?

Happens in Chrome, Firefox and Edge and Brave.

Anything else to add?

We deploy Grafana and OnCall in separate deployments:

root@senst-sv-k3s01 ~# kubectl get -n grafana all
NAME                           READY   STATUS    RESTARTS      AGE
pod/grafana-5689bb9b5c-ssmc6   4/4     Running   1 (91m ago)   91m

NAME              TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
service/grafana   ClusterIP   10.43.143.40   <none>        3000/TCP   418d

NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/grafana   1/1     1            1           418d

NAME                                 DESIRED   CURRENT   READY   AGE
replicaset.apps/grafana-5689bb9b5c   1         1         1       91m
replicaset.apps/grafana-59578dfdbd   0         0         0       2d19h
replicaset.apps/grafana-5b7cdd696d   0         0         0       2d21h
replicaset.apps/grafana-5cf74df48b   0         0         0       2d19h
replicaset.apps/grafana-5dbb58c767   0         0         0       2d19h
replicaset.apps/grafana-6546ffbb6b   0         0         0       2d21h
replicaset.apps/grafana-67977b5965   0         0         0       4h9m
replicaset.apps/grafana-7944d879c9   0         0         0       2d19h
replicaset.apps/grafana-7d68c77b4c   0         0         0       2d19h
replicaset.apps/grafana-95bcf4444    0         0         0       2d19h
replicaset.apps/grafana-99c7957b8    0         0         0       2d19h

root@senst-sv-k3s01 ~# kubectl get -n oncall all
NAME                          READY   STATUS    RESTARTS        AGE
pod/oncall-6c84c58bc4-8bszx   5/5     Running   155 (46m ago)   6d18h

NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/oncall   ClusterIP   10.43.130.73   <none>        3306/TCP,5672/TCP,8080/TCP   378d

NAME                     READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/oncall   1/1     1            1           378d

NAME                                DESIRED   CURRENT   READY   AGE
replicaset.apps/oncall-54cd95b4dc   0         0         0       6d20h
replicaset.apps/oncall-587969d66f   0         0         0       9d
replicaset.apps/oncall-59dd5884df   0         0         0       75d
replicaset.apps/oncall-6476568b55   0         0         0       35d
replicaset.apps/oncall-6b5c4c87fb   0         0         0       6d19h
replicaset.apps/oncall-6c84c58bc4   1         1         1       6d18h
replicaset.apps/oncall-6d4ff5946c   0         0         0       34d
replicaset.apps/oncall-796fccc755   0         0         0       38d
replicaset.apps/oncall-79f6d85476   0         0         0       75d
replicaset.apps/oncall-7c486b8d59   0         0         0       75d
replicaset.apps/oncall-844fc69fc7   0         0         0       75d

The OnCall deployment bundles RabbitMQ, MySQL and Redis while Grafana only has Postgres bundled.

A Helm chart is not ued - my former collegue wrote those by hand (and it shows...) so we update the versions manually by changing the version tag on the images. The cluster is built on three nodes.

I hope I provided all the information that there is - I looked around more but couldn't find anything there. Sorry if I oversaw something or forgot to add - I'm trying my est to work with this situation I am in. :)

The text was updated successfully, but these errors were encountered:

senpro-ingwersenk added the bug Something isn't working label Sep 9, 2024

github-actions bot added needs triage part:alert flow & configuration labels Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alert groups reappear with a new ID after a while #4998

Alert groups reappear with a new ID after a while #4998

senpro-ingwersenk commented Sep 9, 2024

Alert groups reappear with a new ID after a while #4998

Alert groups reappear with a new ID after a while #4998

Comments

senpro-ingwersenk commented Sep 9, 2024

What went wrong?

How do we reproduce it?

Grafana OnCall Version

Product Area

Grafana OnCall Platform?

User's Browser?

Anything else to add?