Skip to content

Commit

Permalink
chore(modules): add a new module for golden signal alerts based on ne…
Browse files Browse the repository at this point in the history
…wrelic_nrql_alert_condition (#2715)

Co-authored-by: pranav-new-relic <[email protected]>
  • Loading branch information
shashank-reddy-nr and pranav-new-relic authored Jul 22, 2024
1 parent 3560052 commit a9e194c
Show file tree
Hide file tree
Showing 9 changed files with 319 additions and 67 deletions.
92 changes: 92 additions & 0 deletions examples/modules/golden-signal-alerts-new/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Module: Golden Signal Alerts [New]:
This module encapsulates an alerting strategy based on the [Four Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) introduced in Google’s widely read book on [Site Reliability Engineering](https://landing.google.com/sre/sre-book/toc/index.html).

The signals chosen for this module are:

* *Latency*: High response time (seconds)
* *Traffic*: Low throughput (requests/minute)
* *Errors*: Error rate (errors/minute)
* *Saturation*: CPU utilization (percentage utilized)

### Requirements
Applications making use of this module need to be reporting data into both APM and Infrastructure.

### Input variables
The following input variables are accepted by the module:

* `name`: The APM application name as reported to New Relic
* `threshold_duration`: The duration, in seconds, that the condition must violate the threshold before creating a violation.
* `cpu_threshold`: The critical threshold of the CPU utilization condition, as a percentage
* `error_percentage_threshold`: The critical threshold of the error rate condition, as a percentage
* `response_time_threshold`: The critical threshold of the response time condition, in seconds
* `throughput_threshold`: The critical threshold of the throughput condition, in requests/second

### Outputs
The following output values are provided by the module:

* `policy_id`: The ID of the created alert policy
* `cpu_condition_id`: The ID of the created high CPU alert condition
* `error_percentage_condition_id`: The ID of the created error percentage alert condition
* `response_time_condition_id`: The ID of the created response time alert condition
* `throughput_condition_id`: The ID of the created throughput alert condition


### Example usage
```terraform
data "newrelic_notification_destination" "webhook_destination" {
name = "Golden Signal Webhook Testing"
}
# Resource
resource "newrelic_notification_channel" "webhook_notification_channel" {
name = "webhook-example"
type = "WEBHOOK"
destination_id = data.newrelic_notification_destination.webhook_destination.id
product = "IINT"
property {
key = "payload"
value = "{\n\t\"name\": \"foo\"\n}"
label = "Payload Template"
}
}
data "newrelic_notification_destination" "email_destination" {
name = "golden signals testing mail"
}
resource "newrelic_notification_channel" "email_notification_channel" {
name = "email-example"
type = "EMAIL"
destination_id = data.newrelic_notification_destination.email_destination.id
product = "IINT"
property {
key = "subject"
value = "New Subject Title"
}
property {
key = "customDetailsEmail"
value = "issue id - {{issueId}}"
}
}
module "webportal_alerts" {
// Please specify the path of the source of this module according to the location you've placed the module in.
// The path specified below assumes you're using this module from a clone of this repo, in the `newrelic.tf` file in the `testing` folder.
// However, if you'd like to use a remote version of this module (without a cloned version of this), the right value of the argument source would be "github.com/newrelic/terraform-provider-newrelic//examples/modules/golden-signal-alerts-new".
source = "../examples/modules/golden-signal-alerts-new"
notification_channel_ids = [newrelic_notification_channel.webhook_notification_channel.id, newrelic_notification_channel.email_notification_channel.id]
service = {
name = "Dummy App Pro Max"
threshold_duration = 420
cpu_threshold = 90
response_time_threshold = 5
error_percentage_threshold = 10
throughput_threshold = 300
}
}
```
103 changes: 103 additions & 0 deletions examples/modules/golden-signal-alerts-new/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
data "newrelic_entity" "application" {
name = var.service.name
type = "APPLICATION"
domain = "APM"
}

resource "newrelic_alert_policy" "golden_signal_policy" {
name = "Golden Signals - ${var.service.name}"
}

resource "newrelic_nrql_alert_condition" "response_time_web" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "High Response Time (web)"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT filter(average(newrelic.timeslice.value), WHERE metricTimesliceName = 'HttpDispatcher') OR 0 FROM Metric WHERE appId IN (${data.newrelic_entity.application.application_id}) AND metricTimesliceName IN ('HttpDispatcher', 'Agent/MetricsReported/count') FACET appId"
}

critical {
operator = "above"
threshold = var.service.response_time_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_nrql_alert_condition" "throughput_web" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "Low Throughput (web)"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = 'HttpDispatcher') OR 0 FROM Metric WHERE appId IN (${data.newrelic_entity.application.application_id}) AND metricTimesliceName IN ('HttpDispatcher', 'Agent/MetricsReported/count') FACET appId"
}

critical {
operator = "below"
threshold = var.service.throughput_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_nrql_alert_condition" "error_percentage" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "High Error Percentage"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT ((filter(count(newrelic.timeslice.value), where metricTimesliceName = 'Errors/all') / filter(count(newrelic.timeslice.value), WHERE metricTimesliceName IN ('HttpDispatcher', 'OtherTransaction/all'))) OR 0) * 100 FROM Metric WHERE appId IN (${data.newrelic_entity.application.application_id}) AND metricTimesliceName IN ('Errors/all', 'HttpDispatcher', 'OtherTransaction/all', 'Agent/MetricsReported/count') FACET appId"
}

critical {
operator = "above"
threshold = var.service.error_percentage_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_nrql_alert_condition" "high_cpu" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "High CPU usage"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT average(cpuPercent) FROM SystemSample WHERE (`applicationId` = '${data.newrelic_entity.application.application_id}') FACET entityId"
}

critical {
operator = "above"
threshold = var.service.cpu_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_workflow" "golden_signal_workflow" {
name = "Golden Signals Workflow ${var.service.name}"
muting_rules_handling = "NOTIFY_ALL_ISSUES"

issues_filter {
name = " Golden signal policy Ids filter"
type = "FILTER"

predicate {
attribute = "labels.policyIds"
operator = "EXACTLY_MATCHES"
values = [newrelic_alert_policy.golden_signal_policy.id]
}
}
dynamic "destination" {
for_each = var.notification_channel_ids
content {
channel_id = destination.value
}
}
}
19 changes: 19 additions & 0 deletions examples/modules/golden-signal-alerts-new/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
output "policy_id" {
value = newrelic_alert_policy.golden_signal_policy.id
}

output "response_time_condition_id" {
value = newrelic_nrql_alert_condition.response_time_web.id
}

output "throughput_condition_id" {
value = newrelic_nrql_alert_condition.throughput_web.id
}

output "error_percentage_condition_id" {
value = newrelic_nrql_alert_condition.error_percentage.id
}

output "cpu_condition_id" {
value = newrelic_nrql_alert_condition.high_cpu.id
}
7 changes: 7 additions & 0 deletions examples/modules/golden-signal-alerts-new/providers.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
terraform {
required_providers {
newrelic = {
source = "newrelic/newrelic"
}
}
}
16 changes: 16 additions & 0 deletions examples/modules/golden-signal-alerts-new/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
variable "service" {
description = "The service to create alerts for"
type = object({
name = string
threshold_duration = number
cpu_threshold = number
response_time_threshold = number
error_percentage_threshold = number
throughput_threshold = number
})
}

variable "notification_channel_ids" {
description = "The IDs of notification channels to add to this policy"
type = list(string)
}
12 changes: 10 additions & 2 deletions examples/modules/golden-signal-alerts/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
# [Golden Signal Alerts](modules/golden-signal-alerts)
# Module: Golden Signal Alerts [Deprecated]:

**⚠ WARNING**:

This module, [golden-signal-alerts](https://github.com/newrelic/terraform-provider-newrelic/tree/main/examples/modules/golden-signal-alerts), functions using multiple resources in the New Relic Terraform Provider that have been **deprecated** and will be removed in the next major release. These resources include `newrelic_alert_policy_channel`, `newrelic_infra_alert_condition`, and `newrelic_alert_condition`.

To set up golden signal alerts using a similar module with newer alternatives to the legacy resources listed above, **please use the newer alternative to the module linked above, which has recently been added: [golden-signal-alerts-new](https://github.com/newrelic/terraform-provider-newrelic/tree/main/examples/modules/golden-signal-alerts-new)**.
______

This module encapsulates an alerting strategy based on the [Four Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) introduced in Google’s widely read book on [Site Reliability Engineering](https://landing.google.com/sre/sre-book/toc/index.html).

The signals chosen for this module are:
Expand All @@ -17,7 +25,7 @@ The following input variables are accepted by the module:
* `name`: The APM application name as reported to New Relic
* `duration`: The duration to evaluate the alert conditions over, in minutes
* `cpu_threshold`: The critical threshold of the CPU utilization condition, as a percentage
* `error_percentage_threshold`: The critical threshold of the error rate condition, in errors/min
* `error_percentage_threshold`: The critical threshold of the error rate condition, as a percentage
* `response_time_threshold`: The critical threshold of the response time condition, in seconds
* `throughput_threshold`: The critical threshold of the throughput condition, in requests/min

Expand Down
Loading

0 comments on commit a9e194c

Please sign in to comment.