Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(modules): add a new module for golden signal alerts based on newrelic_nrql_alert_condition #2715

Merged
merged 5 commits into from
Jul 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
92 changes: 92 additions & 0 deletions examples/modules/golden-signal-alerts-new/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# Module: Golden Signal Alerts [New]:
This module encapsulates an alerting strategy based on the [Four Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) introduced in Google’s widely read book on [Site Reliability Engineering](https://landing.google.com/sre/sre-book/toc/index.html).

The signals chosen for this module are:

* *Latency*: High response time (seconds)
* *Traffic*: Low throughput (requests/minute)
* *Errors*: Error rate (errors/minute)
* *Saturation*: CPU utilization (percentage utilized)

### Requirements
Applications making use of this module need to be reporting data into both APM and Infrastructure.

### Input variables
The following input variables are accepted by the module:

* `name`: The APM application name as reported to New Relic
* `threshold_duration`: The duration, in seconds, that the condition must violate the threshold before creating a violation.
* `cpu_threshold`: The critical threshold of the CPU utilization condition, as a percentage
* `error_percentage_threshold`: The critical threshold of the error rate condition, as a percentage
* `response_time_threshold`: The critical threshold of the response time condition, in seconds
* `throughput_threshold`: The critical threshold of the throughput condition, in requests/second

### Outputs
The following output values are provided by the module:

* `policy_id`: The ID of the created alert policy
* `cpu_condition_id`: The ID of the created high CPU alert condition
* `error_percentage_condition_id`: The ID of the created error percentage alert condition
* `response_time_condition_id`: The ID of the created response time alert condition
* `throughput_condition_id`: The ID of the created throughput alert condition


### Example usage
```terraform

data "newrelic_notification_destination" "webhook_destination" {
name = "Golden Signal Webhook Testing"
}

# Resource
resource "newrelic_notification_channel" "webhook_notification_channel" {
name = "webhook-example"
type = "WEBHOOK"
destination_id = data.newrelic_notification_destination.webhook_destination.id
product = "IINT"

property {
key = "payload"
value = "{\n\t\"name\": \"foo\"\n}"
label = "Payload Template"
}
}

data "newrelic_notification_destination" "email_destination" {
name = "golden signals testing mail"
}

resource "newrelic_notification_channel" "email_notification_channel" {
name = "email-example"
type = "EMAIL"
destination_id = data.newrelic_notification_destination.email_destination.id
product = "IINT"

property {
key = "subject"
value = "New Subject Title"
}

property {
key = "customDetailsEmail"
value = "issue id - {{issueId}}"
}
}

module "webportal_alerts" {
// Please specify the path of the source of this module according to the location you've placed the module in.
// The path specified below assumes you're using this module from a clone of this repo, in the `newrelic.tf` file in the `testing` folder.
// However, if you'd like to use a remote version of this module (without a cloned version of this), the right value of the argument source would be "github.com/newrelic/terraform-provider-newrelic//examples/modules/golden-signal-alerts-new".
source = "../examples/modules/golden-signal-alerts-new"
notification_channel_ids = [newrelic_notification_channel.webhook_notification_channel.id, newrelic_notification_channel.email_notification_channel.id]

service = {
name = "Dummy App Pro Max"
threshold_duration = 420
cpu_threshold = 90
response_time_threshold = 5
error_percentage_threshold = 10
throughput_threshold = 300
}
}
```
103 changes: 103 additions & 0 deletions examples/modules/golden-signal-alerts-new/main.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
data "newrelic_entity" "application" {
name = var.service.name
type = "APPLICATION"
domain = "APM"
}

resource "newrelic_alert_policy" "golden_signal_policy" {
name = "Golden Signals - ${var.service.name}"
}

resource "newrelic_nrql_alert_condition" "response_time_web" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "High Response Time (web)"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT filter(average(newrelic.timeslice.value), WHERE metricTimesliceName = 'HttpDispatcher') OR 0 FROM Metric WHERE appId IN (${data.newrelic_entity.application.application_id}) AND metricTimesliceName IN ('HttpDispatcher', 'Agent/MetricsReported/count') FACET appId"
}

critical {
operator = "above"
threshold = var.service.response_time_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_nrql_alert_condition" "throughput_web" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "Low Throughput (web)"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT filter(count(newrelic.timeslice.value), WHERE metricTimesliceName = 'HttpDispatcher') OR 0 FROM Metric WHERE appId IN (${data.newrelic_entity.application.application_id}) AND metricTimesliceName IN ('HttpDispatcher', 'Agent/MetricsReported/count') FACET appId"
}

critical {
operator = "below"
threshold = var.service.throughput_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_nrql_alert_condition" "error_percentage" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "High Error Percentage"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT ((filter(count(newrelic.timeslice.value), where metricTimesliceName = 'Errors/all') / filter(count(newrelic.timeslice.value), WHERE metricTimesliceName IN ('HttpDispatcher', 'OtherTransaction/all'))) OR 0) * 100 FROM Metric WHERE appId IN (${data.newrelic_entity.application.application_id}) AND metricTimesliceName IN ('Errors/all', 'HttpDispatcher', 'OtherTransaction/all', 'Agent/MetricsReported/count') FACET appId"
}

critical {
operator = "above"
threshold = var.service.error_percentage_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_nrql_alert_condition" "high_cpu" {
policy_id = newrelic_alert_policy.golden_signal_policy.id
name = "High CPU usage"
fill_option = "static"
fill_value = 0

nrql {
query = "SELECT average(cpuPercent) FROM SystemSample WHERE (`applicationId` = '${data.newrelic_entity.application.application_id}') FACET entityId"
}

critical {
operator = "above"
threshold = var.service.cpu_threshold
threshold_duration = var.service.threshold_duration
threshold_occurrences = "all"
}
}

resource "newrelic_workflow" "golden_signal_workflow" {
name = "Golden Signals Workflow ${var.service.name}"
muting_rules_handling = "NOTIFY_ALL_ISSUES"

issues_filter {
name = " Golden signal policy Ids filter"
type = "FILTER"

predicate {
attribute = "labels.policyIds"
operator = "EXACTLY_MATCHES"
values = [newrelic_alert_policy.golden_signal_policy.id]
}
}
dynamic "destination" {
for_each = var.notification_channel_ids
content {
channel_id = destination.value
}
}
}
19 changes: 19 additions & 0 deletions examples/modules/golden-signal-alerts-new/outputs.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
output "policy_id" {
value = newrelic_alert_policy.golden_signal_policy.id
}

output "response_time_condition_id" {
value = newrelic_nrql_alert_condition.response_time_web.id
}

output "throughput_condition_id" {
value = newrelic_nrql_alert_condition.throughput_web.id
}

output "error_percentage_condition_id" {
value = newrelic_nrql_alert_condition.error_percentage.id
}

output "cpu_condition_id" {
value = newrelic_nrql_alert_condition.high_cpu.id
}
7 changes: 7 additions & 0 deletions examples/modules/golden-signal-alerts-new/providers.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
terraform {
required_providers {
newrelic = {
source = "newrelic/newrelic"
}
}
}
16 changes: 16 additions & 0 deletions examples/modules/golden-signal-alerts-new/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
variable "service" {
description = "The service to create alerts for"
type = object({
name = string
threshold_duration = number
cpu_threshold = number
response_time_threshold = number
error_percentage_threshold = number
throughput_threshold = number
})
}

variable "notification_channel_ids" {
description = "The IDs of notification channels to add to this policy"
type = list(string)
}
12 changes: 10 additions & 2 deletions examples/modules/golden-signal-alerts/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
# [Golden Signal Alerts](modules/golden-signal-alerts)
# Module: Golden Signal Alerts [Deprecated]:

**⚠ WARNING**:

This module, [golden-signal-alerts](https://github.com/newrelic/terraform-provider-newrelic/tree/main/examples/modules/golden-signal-alerts), functions using multiple resources in the New Relic Terraform Provider that have been **deprecated** and will be removed in the next major release. These resources include `newrelic_alert_policy_channel`, `newrelic_infra_alert_condition`, and `newrelic_alert_condition`.

To set up golden signal alerts using a similar module with newer alternatives to the legacy resources listed above, **please use the newer alternative to the module linked above, which has recently been added: [golden-signal-alerts-new](https://github.com/newrelic/terraform-provider-newrelic/tree/main/examples/modules/golden-signal-alerts-new)**.
______

This module encapsulates an alerting strategy based on the [Four Golden Signals](https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/#xref_monitoring_golden-signals) introduced in Google’s widely read book on [Site Reliability Engineering](https://landing.google.com/sre/sre-book/toc/index.html).

The signals chosen for this module are:
Expand All @@ -17,7 +25,7 @@ The following input variables are accepted by the module:
* `name`: The APM application name as reported to New Relic
* `duration`: The duration to evaluate the alert conditions over, in minutes
* `cpu_threshold`: The critical threshold of the CPU utilization condition, as a percentage
* `error_percentage_threshold`: The critical threshold of the error rate condition, in errors/min
* `error_percentage_threshold`: The critical threshold of the error rate condition, as a percentage
* `response_time_threshold`: The critical threshold of the response time condition, in seconds
* `throughput_threshold`: The critical threshold of the throughput condition, in requests/min

Expand Down
Loading
Loading