Skip to content

Commit

Permalink
update wording
Browse files Browse the repository at this point in the history
  • Loading branch information
hazelweakly committed Dec 10, 2024
1 parent 032a33c commit 2a213b0
Showing 1 changed file with 56 additions and 40 deletions.
96 changes: 56 additions & 40 deletions src/blog/the-future-of-observability-observability-3-0.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,14 @@ draft: true
---

Observability, so hot right now.
Over the years, we've seen observability go from a word nobody used, to a word everyone used.
We've seen projects come and go, and we've seen technologies emerge out of the ashes, born from the tears of SREs long departed.
Over the years, we've seen observability go from an unknown concept to a ubiquitious phrase that everyone is desperate to stamp on their products.
We've seen projects come, evolve, and die.
We've seen technologies emerge out of the ashes, born from the tears of SREs long departed.
Yet, amongst all of this growth, all of this innovation, one question remains: and then _what_?

You see, it turns out observability is pretty useless, because it doesn't _do_ anything.
You see, it turns out observability is pretty useless because it doesn't _do_ anything.
Not by itself, that is.
Which makes sense! Computers don't do anything until you turn them on, bikes don't go forward unless you pedal them, and raw materials won't turn themselves into a building without blueprints and labor.
Which makes sense! Computers don't do anything until you turn them on; bikes don't go forward unless you pedal them; raw materials won't turn themselves into a building without blueprints and labor.
But observability? Somehow it's this thing that we've been able to grow into a multi billion dollar industry composed entirely out of "get a bunch of data, and then..."

That's it.
Expand All @@ -26,125 +27,140 @@ And then **WHAT**?
Draw the rest of the fucking owl already.

If you remember seeing a previous post I wrote on [redefining observability](/redefining-observability), I brought this up as something I specifically wanted to address in the new definition of observability that I proposed.
In fact, this "and then" problem is a huge reason why my definition is:
In fact, this "and then" problem is a huge reason why my definition is

> A process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.
> A process through which one develops the ability to ask meaningful questions, get useful answers, **and act effectively on what you learn.**
See that? The "act effectively" part.
That, that right there, that's the stuff.
Gimme more of that.
I still don't see nearly enough of that anywhere.
I still don't see nearly enough of that.
Honestly, I don't really see it talked about anywhere, period.

Which is wild, right?
Imagine a sales team that had a motto of "we create sales leads!" and then never talked about the "and then" part.
Imagine a marking department whose mission was "we create impactful campaigns to raise awareness!" and then never bothered to think about what comes after that.
Imagine an engineering organization who aligned around the most beautiful strategy of "we design reliable systems" and then forgot to care about writing the fucking code in the first place.

Seriously? Who would hire those jokers?
Seriously? Who would hire those jokers? Who would trust products produced by that malarkey?

Yet I see so much observability out there that's all about data, gathering data, thinking about data, cleaning data, making systems observable, making systems monitorable, ...
Yet I see so much talk about observability out there that's all about data, gathering data, thinking about data, cleaning data, making systems observable, making systems monitorable, ...

And then what?

Alerts! They cry! We'll get SLOs! SLIs! So many fucking configuration files! We'll instrument your deployments! You'll be able to slice and dice! The whole world's your oyster!

**And. Then. What?**

(Crickets chirp. Angels weep.)

## Observability Through the Ages

Well, let's back up. How'd we get here, anyways?
You can't understand "and then what"
Let's go back in time to when we had observability 1.0, which back then we just called "instrumenting your system so you can unfuck it up later" like a buncha heathens.
Let's back up. How'd we get here, anyways?
You can't understand what's next without knowing where you came from; you can't draw the rest of the owl if you don't know what an owl is in the first place.

Okay, we'll go back in time to when we had observability 1.0 and start there.
Back then we just called it "instrumenting your system so you can unfuck it up later" like a buncha heathens.
We had the three pillars of logs, traces, metrics, and all of their derivatives (y'know, RUM, APM, dashboards, ...).

Life was great, right?

As an engineering executive, all you needed to do and worry about was to implement the three pillars...
And then what?
Eh, don't worry about that, just implement the pillars.
They're capabilities! It's a maturity model, just allocate the headcount, implement the thing, watch number go from zero to mature, and sleep tight knowing that High Impact Awesomesauce is happening.
They're capabilities! It's a maturity model! Just allocate the headcount, implement the thing, watch number go from zero to mature, and sleep tight knowing that High Impact Awesomesauce is happening.

Well, okay, it turns out we need to back up a little bit because that isn't really what happens in reality.
Despite implementing the three pillars, you run into a lot of limitations, all stemming from a central foundational choice: multiple sources of truth, with no ability to correlate them, lead to an inability to ask meaningful questions.
Charity calls this ["making decisions at write time about how you and yoru team would use the data in the future"](https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0) in her nice observability 1.0 vs 2.0 article.
Despite implementing the three pillars, you run into a lot of limitations, all stemming from a central foundational choice: multiple sources of truth, with no ability to correlate them, leads to an inability to ask meaningful questions.
Charity referrs to this this when she talks about how observability 1.0 ends up ["making decisions at write time about how you and yoru team would use the data in the future"](https://www.honeycomb.io/blog/one-key-difference-observability1dot0-2dot0) in her nice observability 1.0 vs 2.0 article.
I like that phrasing a lot, but I personally like to emphasize the "inability to ask meaningful questions" part a bit more than the "write-time decisions" and "existance of pillars" parts.
That's just me, though.

However, I do want to note that these are equivalent ideas: I'm not reinventing anything here.
That might get missed, so I wanna really spell it out.
These three concepts are equivalent.
These concepts are equivalent.
If anything, they're duals to each other; different facets of the same shape:

- Observability 1.0 is defined by pillars
- Observability 1.0 defines the questions you can ask at write time
- Observability 1.0 results in an inability to ask meaningful questions

Now, let's go forward in time a little bit, to where observability 2.0 comes in.
Now, let's go forward in time a little bit to where observability 2.0 comes in.
The thing that defines observability 2.0 is a combination of one data format, one storage location, and one source of truth.
Charity would argue that the structured log events are the data format that is required by observability 2.0; I think anything that lets you build relational data structure on top works.
Which implies that a time series, logs, and any other temporal data structure with metadata works fine.
And while that's... _true_... seriously, use structured log events, you'll hate yourself a whole lot less.
Consequently, that implies that a time series, logs, and any other temporal data structure that can be decorated with metadata works fine.
While, yes, that's _true_... seriously, use structured log events, you'll hate yourself a whole lot less.

Regardless! The big shift here from an _implementation_ standpoint is uniformity in data format, storage location, and source of truth.
And from a capabilities standpoint, you get the new ability to correlate information together.
And from a capabilities standpoint, you get the ability to correlate information together.
Now, rather than defining what questions you ask at write time, you define what correlations you can make at write time.
Huge improvement!
I can't overstate what an improvement this is; it's a game changer.

And again, a lot of these different ideas are all equivalent.
Again, a lot of these different ideas are all equivalent and are really more like different ways to articulate the same point.

- Observability 2.0 is defined by structured wide events as a single source of truth
- Observability 2.0 defines the correlations you can make at write time
- Observability 2.0 results in the potential to ask meaningful questions

Oh hell yeah, now we have the potential to ask meaningful questions.
This is awesome.
See that last one? Now we have the potential to ask meaningful questions.
Oh hell yeah, this is awesome.
There's only one teensy tiny problem I have with this, though: "and then what?"

You can ask meaningful questions now...
You can ask meaningful questions now... But there's still that one last question lingering in the back of the mind.

## And Then What?

Firstly, I'm going to make a somewhat controversial claim in that you can get observability 2.0 just fine with "observability 1.0" vendors; the only thing you need is the ability to query correlations.
Firstly, I'm going to make a somewhat controversial claim in that you can get observability 2.0 just fine with "observability 1.0" vendors.
The only thing you need from a UX standpoint is the ability to query correlations, which means any temporal datastructure, decorated with metadata, is sufficient.
Hell if you hate yourself enough, you don't actually need the temporal part to be a real clock, a logical clock works just fine.

Now, is that hard as fuck with observability 1.0 tooling? Yeah, generally; there's a reason you don't really do that.
I mean, you can [implement your entire backend in brainfuck](https://sourceforge.net/projects/brainfix/) too, but also... _Why_?
I mean, you can _also_ [implement your entire backend in brainfuck](https://sourceforge.net/projects/brainfix/) too, but... _Why_?

The point I'm really making is that the tooling and/or vendor choice(s) don't actually restrict or limit the capabilities you get out of them.
The point I'm really making here is that the tooling and/or vendor choice(s) don't actually restrict or limit the capabilities you get out of them from a purely technical standpoint.
Which, naturally, goes both ways.
I can't tell you how many times I've run into people using observability 2.0 tooling, super modern vendors, really excellent tooling, and getting absolutely zero value out of it.
Slicing and dicing autoinstrumented code with zero manual instrumentation, _wrong_ instrumentation, broken service graphs, disconnected distributed tracing, and every other crime under the sun.
Not only is it quite _possible_ to hold the tool wrong, but damn y'all, I remain fairly convinced that "holding it wrong" is the case in by far the vast majority of observability implementations out there.

So, in comes observability 3.0.
Or, rather, my prediction on what observability 3.0 is going to look like.
Which means, there's gotta be something else here; it's not just the ones and zeroes, because those aren't the thing holding us back as an industry.
So in comes observability 3.0.
Or rather, my prediction on what observability 3.0 is going to look like.
There's a technical component to it, sure, but the main one is social.

Are you ready?
Are you ready? Here are my predictions:

- Observability 3.0 is going to look a _lot_ like a data lakehouse architecture
- Observability 3.0 is going to mostly erase the distinction between pay now / pay later and write vs read time for querying capabilities
- Observability 3.0 backends are going to look a _lot_ like a data lakehouse architecture
- Observability 3.0 will expand query capabilities to the point that it mostly erases the distinction between pay now / pay later, or "write time" vs "read time"
- Observability 3.0 will, more than anything else, be measured by the value that _non-engineering functions_ in the business are able to get from it

The critical difference in observability 3.0 vs 2.0 and 1.0, for me, is that the success of rolling out observability 1.0 or 2.0 relies entirely on how valuable it is for the engineering organizations in the business.
For observability 3.0, that'll still be important, but the success will be mostly defined by how non-engineering functions are able to use it.
That last one is the big one, it's the whole point of the damn thing, and it's the entire reason for the 3.0 instead of calling this a 2.5 or something like that.
The critical difference in observability 2.0 and 1.0 vs 3.0, for me, is that the success of rolling out observability 1.0 or 2.0 relies entirely on how valuable it is for the engineering organizations in the business.
For observability 3.0, that'll still be important, but the success of it will be mostly defined by how non-engineering functions are able to use it.

Remember my definition of observability? Here it is, for posterity:

> A process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn.
Observability 1.0 gave us lots of useful answers, observability 2.0 gives us the potential to ask meaningful questions, and observability 3.0 is going to give us the ability to act effectively on what we learn.

In the next few years, I'm going to be looking at observability vendors pretty critically, looking for indicators that they're indexing on the "act effectively" part.
I'm especially going to be looking for vendors to think deeply about how they bring the whole rest of the business in on this too.
There's a lot of things moving underneath the surface that make me think this is going to be a much bigger thing soon.
In fact, if anything, I wouldn't be surprised if this type of thinking defined meaningful innovation in observability for the next few years.

I also don't think it's incompatible with existing vendors!
I also don't think observability 3.0 is incompatible with existing vendors!
You don't have to rip out your existing stack to get observability 3.0 and, honestly, please don't.
Or at least, not yet.
But frankly, if your vendors can't help you deliver meaningful value to the entire business then why are they even there?
That extends to engineering leadership as well.

Overall, I'm excited to see how the observability product offerings get refined over the next few years to become increasingly valuable to those outside of engineering.
But, today, I don't think we do a very good job as an industry of bringing the rest of the business along for the ride, which is why I wrote this post.
If the business can't effectively learn, then there _is_ no "and then what."
Naturally, that extends to engineering leadership as well; if they can't figure out how to turn headcount into business results, exit them.
We've had multiple decades as an industry to figure out how to deliver meaningful business value in a transparent manner, and if engineering leaders can't catch up to other C-suites in that department soon, I don't expect them to stick around another decade.

Back to observability: Overall, I'm excited to see how the observability product offerings get refined over the next few years to become increasingly valuable to those outside of engineering.
Today, I don't think we do a very good job as an industry of bringing the rest of the business along for the ride, which is why I wrote this post, but I don't think there's anything stopping us from doing this once we start treating it as important.
After all, if the business can't effectively learn, then there _is_ no "and then what."

So what is the "and then what", you ask? Effective and collaborative action. Organizational learning. [Collective social learning.](https://osf.io/preprints/psyarxiv/tfjyw) That's what.

Always has been.

0 comments on commit 2a213b0

Please sign in to comment.