Skip to content

Commit

Permalink
fix: cleanup the drafts
Browse files Browse the repository at this point in the history
  • Loading branch information
gregswift committed Aug 4, 2024
1 parent f6b8420 commit 5d0aad9
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 38 deletions.
5 changes: 4 additions & 1 deletion .markdownlint.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
{
"line-length": false,
"no-inline-html": false
"no-inline-html": false,
"no-duplicate-heading": {
"siblings_only": true
}
}
2 changes: 1 addition & 1 deletion content/posts/environment-aspects.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,14 @@ generation

6? That seems like a lot, or maybe not enough? You don't think you need all of this? I mean, fair. I didn't used to either. Its important to note that Order in the hierarchy may vary between companies, but variance inside a company is a risk to the future maintainers, who will likely curse you.


Before we start its important to call out [Greg's Pro-Tip #1 - NONE of these aspects should tie to your business organizational structure.]

## Workload

Each Application may have multiple components such as an api, web, database, cache layer, or any dozens+ of other types. These are all different Workload types, you may have more, you may have less. You may not even run them on separate servers when you are first starting off, which is not a great idea, but used to be super common.

### Examples

* api
* web
* db
Expand Down
68 changes: 36 additions & 32 deletions content/posts/getting-more-bang-for-your-buck-with-pagerduty.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,47 +3,51 @@ title = "Getting more bang for your buck with PagerDuty"
date = 2023-05-31T10:49:27-05:00
draft = true
+++
# Problems or at least statements of reality

## Problems or at least statements of reality

* Dealing w/ alerting on build/decom seems to be a consistently painful and inconsistent
* Most companies don’t leverage PagerDuty’s services as services, they do Service = ( Team | Urgency for a Team | Alert Channel for a Team)


I've now worked with PagerDuty for over a decade and talked to engineers at multiple companies that use it. I get why its still the heap, but really, it tends to be barely leverage PagerDuty except as a scheduler+pager. This is its original purpose, but jsut for that it has become a very expensive path.

Very little in Pagerduty seems to be terraformized.
Most companies don’t have a lot of encouraged shareable best practices around using PagerDuty, its very much "there is our tool

(and it seems like our integration is actually rather weak? i’m talkin
@bgoleno
about this)
High level thoughts
Manage PagerDuty more programmatically (read: terraform)
Specifically services and event rules
These can be exposed as modules but also maybe we can make it so people don’t even need to use the modules directly, its just part of the build
Arguably even GC or treasuremap could be utilized for this
Represent our architecture in PagerDuty (and also New Relic) rather than representing teams, specifically by utilizing PagerDuty’s basic Technical and Business Services model.
App/Tech service = ${cell}_${appstack}_${application} (appstack preferably should not == org/team names but.. someone will do it)
Business services that define dependencies with other Business and Tech Services to show:
A tier (test,staging,perf,prod) Test Environment US Production (Which, mixing
A cell test-odd-wire
An app stack for a cell: ${tier} Kafka Platform
A Feature: ${tier} - Feature - ${feature} ie: Test - Feature - Ingest
In general I’d prefer that tier get left off the last 2, but simplified data model, ya know?
But now that is a lot of services defined in PagerDuty. How does the data get to the right places?
Each Appstack can create a Event Ruleset in Pagerduty. This acts as a single “entry” point to pagerduty that they would configure as a notification channel in New Relic. Example in PagerDuty
Then as part of setting up each new tech service on cell build, a rule gets added to that Event Ruleset with whatever is appropriately usable metadata in the alert to route the notification to the right Tech Service.
(and it seems like our integration is actually rather weak? i’m talkin to @bgoleno about this)

## High level thoughts

* Manage PagerDuty more programmatically (read: terraform)
* Specifically services and event rules
* These can be exposed as modules but also maybe we can make it so people don’t even need to use the modules directly, its just part of the build
* Service catalog could be utilized for this
* Represent our architecture in PagerDuty rather than representing teams, specifically by utilizing PagerDuty’s basic Technical and Business Services model.
* App/Tech service = ${cell}_${appstack}_${application} (appstack preferably should not == org/team names but.. someone will do it)
* Business services that define dependencies with other Business and Tech Services to show:
* A tier (test,staging,perf,prod) Test Environment US Production (Which, mixing
* A cell test-odd-wire
* An app stack for a cell: ${tier} ${Stack}
* A Feature: ${tier} - Feature - ${feature}
* In general I’d prefer that tier get left off the last 2, but simplified data model, ya know?
* But now that is a lot of services defined in PagerDuty. How does the data get to the right places?
* Each Appstack can create a Event Ruleset in Pagerduty. This acts as a single “entry” point to pagerduty that they would configure as a notification channel in New Relic. Example in PagerDuty
* Then as part of setting up each new tech service on cell build, a rule gets added to that Event Ruleset with whatever is appropriately usable metadata in the alert to route the notification to the right Tech Service.

This may take some tuning and also recommended practices to teams, but we can probably come up with a decent rule set that would work for most and “do it for them” like above as well.
But wait.. what about associations w/ escalation policies and such?
Teams can keep managing those as they do, they just need to provide the right name/id of the escalation policy for the service creation steps.
Challenges
Adoption, but the more of this that is centrally built and managed the easier that is
Metadata and tuning for routing event rules - apparently we are all over the place here.. but that is in itself a problem we should be addressing
nrrdbot and herobot et al rely on Tech services, not escalation policies
1: can still have a fall through Tech Service (their existing one) that the bots and EventRuleset can use to fall back on
2: Just change them to support escalation policies instead ?
Adding dependencies in PagerDuty Business Services ( although, arguably this could be generated from datanerd.yml or grandcentral and managed separately)
Benefits
Can disable an entire cell (or just a single service in the cell) from paging without impacting other cells!
The dependency mapping becomes a thing visible in the Service Graph in PagerDuty, that will also show “status” and support Subscribing to services you care about, but also for when something is having issues in our own stack.
Better reporting in PagerDuty (although NrAiIncidents being exposed makes this way better already)

## Challenges

* Adoption, but the more of this that is centrally built and managed the easier that is
* Metadata and tuning for routing event rules - apparently we are all over the place here.. but that is in itself a problem we should be addressing
1: can still have a fall through Tech Service (their existing one) that the bots and EventRuleset can use to fall back on
2: Just change them to support escalation policies instead ?
* Adding dependencies in PagerDuty Business Services ( although, arguably this could be generated from datanerd.yml or grandcentral and managed separately)

## Benefits

* Can disable an entire set of services or just a single service from paging without impacting other sets
* The dependency mapping becomes a thing visible in the Service Graph in PagerDuty, that will also show “status” and support Subscribing to services you care about, but also for when something is having issues in our own stack.
* Better reporting in PagerDuty
6 changes: 2 additions & 4 deletions content/posts/service-accounts.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,9 @@ So, following this duh moment I realized that lower complexity path of 'one serv
* Save the creds and 2fa in a shared password manager (like 1password)
* Never configure the username/password on anything but the `myteam-ci` system

Cons:
* pay per service account with systems like Okta


## Cons

* pay per service account with systems like Okta

[1] Let's be honest, most people will never rotate it anyway. Something to be said for setting it up, and having the configured copy be the only existing copy. Recreate to fix if it ever breaks. Minimal leakage potential.

Expand Down

0 comments on commit 5d0aad9

Please sign in to comment.