Is it time to version observability? (Signs point to yes) | Charity Majors
The DevOps revolution is winding down, and we’re entering the post-DevOps era. We’re at the precipice of a massive generational shift in how we build and understand our software, and engineering teams need to prepare. In the past, telemetry focused on basic metrics: reliability, uptime, MTTR, and MTTD. Observability 1.0. Companies that settle for these basic data points will struggle to keep up with modern development patterns while hemorrhaging money. As engineering best practices around separating deploys from releases, testing in production, and observability-driven development have gone mainstream, the metrics-driven approach to telemetry has stalled, and it’s time for a new version: Observability 2.0. Learn what this new version means for your engineers, and how to embrace this breaking change to: - Save them from drowning in symptom-based alerting - Help fewer people work together to build better software - Create fast feedback loops throughout the entire organization through highly granular visibility into all their systems About the speaker: Charity is the co-founder and CTO of honeycomb.io, which is bringing observability tooling into the era of distributed systems. She is the co-author of Observability Engineering and Database Reliability Engineering (O'Reilly) and has worked at companies like Parse, Facebook, and Linden Lab. She loves free speech, free software, and single malt scotch.
Transcript
I am so excited to be here. This is my first time in Denmark. I had wanted to come last year, and then I got sick right beforehand. And so my dear friend Abby came instead, - which is quite a treat. I love getting to hear Abby speak. But I'm really excited that I could make it this year. For those of you who don't know me, - my name is Charity. I am a co-founder of Honeycomb.io, - co-author of Database Reliability Engineering - and Observability Engineering. Anybody here happen to have a copy of - the Observability Engineering book? You should see me after class because I brought these sheets of stickers - where you could decorate the little wolfy dog with Hello Kitty ears and - bows and stuff. Anyway, all right, down to business. Every company is now a technology company, - as Marc Andreessen famously said. Software is eating the world. It means a lot of things, but one of the things that it means is that - things like engineering efficiency, productivity and how well - we do our jobs, are no longer niche concerns. I really loved that lightning talk by the guy from Microsoft. So good! I wrote down a bunch of notes. I can't wait to show my co-founder the recording. We have been talking lately about how observability is explaining technology - in the language of the business. I'm old enough that when I was coming up through the ranks, - it was not very cool to care about the business. It was all about tech for the sake of tech. There are a lot of reasons to be depressed in the world right now, - but one of the reasons to not be depressed is I think that, - we're really starting to realize that - tech is in the service of greater things, right? Another consequence of this is, whenever I'm giving someone - career advice, I tell them, the best thing you can do for your career is find - a high-performing team and get on it. High-performing teams are ones that get to spend most of their time working - on interesting, novel problems that move the business materially forward. Those are numbers from the DORA report. Not this year's. I should update them. Part of why this matters is that individuals don't own software. Teams own software. What do we call an individual who owns a service? We call them a single point of failure, right? Teams own software. The team is the smallest viable unit of software ownership. And how do we build high-performing teams? Obviously, it's by hiring all the ex-Googlers and ex-Facebook people - we can get our hands on. No, this is not how we build high-performing teams. The question of how well does your team perform, - is not the same question as: How good are you at engineering? Being good at data structures and algorithms is a necessary precursor, - but it doesn't mean that you're great at developing. Because your ability to ship code swiftly and safely - has less to do with what's up here - and everything to do with - the socio-technical system that you exist in. You can be the best engineer in the world. You could be Kent Beck. If you join a team that takes two months to ship a - single line of code to production, how long is it going to take you? It's going to take you two months. Because the socio-technical system is - what defines how quickly, how swiftly, - how reliably we can ship our code. Which is why I think if technical leaders - have one job, - it's this. It's constructing and tightening the feedback loops - at the heart of your system. When I say technical leaders, I don't just mean managers and directors and VPs. If you're a baby junior engineer, you're not a technical leader yet. Your job is to become a senior engineer. Once you're a senior engineer, you are a technical leader, - or you should be, right? You are someone who is helping. Engineering is not a craft for followers. Engineering is a craft for people who care about building things - and fixing things and understanding things and making them better. When I think of modern software development practices, - I think about things like this. Engineers owning their code in production, - which includes being on call, but is not synonymous with being on call. Practicing observability-driven development, - where you're writing your code, you instrument as you go, - you ship your code, and you look at your code - through the lens of the instrumentation you just wrote, - and you ask yourself, is it doing what I expect it to do? Does anything else look weird? Testing in production. For those of you who can't see my shirt, it says: "Test and prod or live a lie". I also have stickers that say that. People ask me, I can't sell my manager on letting us test in production. How do I do this? Number one trick, don't use the words test in production. Call it validating your code in reality or something like that. But the truth is, - if you have code, you test it in production. The only question is whether or not you admit it. And the first step towards doing it well is admitting it and investing - in the tooling around it so you can do it safely instead of - by the seat of your pants because you don't have any other options. Separating deploys from releases using feature flags, - continuous deployment. And what all these things have in common, - they're all about fast feedback loops. All about getting your code out in front of users as quickly as possible after - you've written it. Because when it comes to software, - speed is safety. As embodied human beings, - we have this tendency when we get scared or anxious to freeze up - and slow down. But that's not good with software. Software is more like ice skating or riding a bicycle, - where if you want to do it safely, - you have to have speed and regularity and consistency. Getting code into production fast is - that kind of key feedback loop that everything else proceeds from. Because the cost of finding and fixing bugs goes up exponentially - from the moment you write them. You type a bug, you backspace, good for you. That's as fast as you can fix it. You find it with your tests, great. Your next best shot of finding the bug - is right after you've shipped it in production. Because right after you've shipped it, - you've got all of the context right here. You know what you built, why you built it, - what you tried that worked, what you tried that didn't work, - what the functions are called, what APIs are called. Everything's right up here. And how long does it last? Not very long. I have a bad memory. But even still, I feel like hours maybe, - a couple of days tops. For me at least, it does not survive paging in and out another project. So looking at your code in production is how you close that loop, - that feedback loop of here's what I'm trying to do, - here's what it's doing. If you don't find it, - it's probably not going to be you that finds the bug. It's probably going to be one of your customers or your users, - maybe another engineer. Who's going to fix it? Probably not you. Some other poor sap. Weeks, months, years into the future. It's just going to recede into the background of, - this is normal, right? There are so many bugs right now that we just think of as normal. Distributed systems all exist in the stage of permanent degradation, - there's so many things broken in your system right now. And this is a good thing. This is fine. You sleep great at night for the most part. This is actually a phenomenal accomplishment that our systems can - be resilient to so many things. Your ability to move swiftly with confidence is grounded in the quality of - your observability. What does this even mean these days? That's a great question. So, when Christina and I started Honeycomb back in 2016, - observability was not a term that was in use in the industry. And we were trying to figure out how to talk about what we were building. That was when I first Googled the term. And I read this definition on Wikipedia. So observability traditionally comes from control system theory, - where it's the mathematical dual of controllability. And it means, how well can you understand the inner state - of your system just by observing its outputs? I read that, and I just went... I had one of those moments where the world stops for a minute, - and you're just processing it. It was so cool. So after that, we spent a couple years trying to... We made this laundry list, and well, this is what it is, - and this is how it's different from monitoring. And 2019, Peter Borgen came out with this definition that observability has - three pillars, metrics, logs, and traces. And a lot of vendors really glommed onto that. They coincidentally had at least three products to sell, - metrics, logs, and tracing products. Got added to the magic quadrant, blah, blah, blah. So at this point, there are so many definitions of observability out there - that I've kind of declared bankruptcy. And I think of it this way. I think that observability is a property of complex systems, - just like reliability or performance or scalability. Which I like because it puts the emphasis on the system - instead of the tooling. But I also don't like because then it sort of leaves us back with - the whole question we were trying to figure out in 2016, - which is how do we sort of differentiate and distinguish the sort of - modern generation of tools that we're trying to build - versus the generation of metrics, logs, and traces. Which is why in - the last few months I've started thinking about it in terms of versions. Semantic versioning tells us what? That whenever you're doing a major version bump, - it's for breaking backwards incompatible changes. I think that that's what we see here. The 1.0 tool universe is one that's grounded in metrics, - logs, and traces. For every request that enters your system, it's actually more than three. You're probably capturing some in structured logs, some are in - unstructured logs, some are in metrics, some are in traces, - but then you've also got your APM tool, you've got your RUM tool, - you've got your profiling tools, you've got your product analytics tools, - you've got your... The multiplier can get quite large sometimes. And what connects all those tools? Nothing. Not much. Usually it's you, the engineer, who's sitting in the middle going, - well, that shape looks like that shape. Probably they're the same thing. Or copy-pasting, let's take this log, - the request ID from the log tool, paste it into the tracing tool, - and cross our fingers that we happened to sample this one in. There's a lot of guessing. There's a lot of jumping around. So how is this different from 2.0? I think that we're seeing an emerging generation of tools with a single source - of truth, arbitrarily wide, - structured log events. These are also sometimes called canonical logs. You could also visualize them over time as a trace if you've added span IDs - and trace IDs, but because you're packaging all that context in, - you can derive metrics. You can derive SLOs. You can derive your dashboards. You can slice and dice on any of those dimensions. You can zoom in. You can zoom out. The reason the 1.0 requires you to capture it so many different times is - because it's not collecting enough context for you to be able to connect - the dots from one to the other. And the magic of the 2.0 model, - which I think Honeycomb was the pioneer, - but we're seeing people start to do this using ClickHouse-based tools. We're seeing people using it using DuckDB. A lot of other columnar databases are now available. I think there are a lot of folks out there who are maybe not using this - terminology, but I think it so clearly is the future. So that's kind of what I want to talk about here is the old generation of tools - and the new generation, why it matters, - how to recognize them, how to build them, - and so forth. Just a quick overview. These slides are posted online. I'm not going to go into this deeply here because we're going to go into each - one of these things, but at a high level, that's your 1.0. And that's your 2.0. Now, when we started trying to define observability - back eight-ish years ago, - we ended up with this laundry list. We were like, there's monitoring, the old way, - and observability's the new way, - and you only have observability if you have high cardinality, - high dimensionality, arbitrarily wide-triggered log events, - the ability to trace. Which, in retrospect, is very obvious why this didn't take off, - because who has time to memorize this shit? Nobody. I think it's much easier to think of it this way. Either you have many tools in a - 1.0 world, or you have one source of truth in a 2.0 world. So let's start first by talking about how the data gets stored. Most of this is pretty self-explanatory, right? You've got your 1.0 world where every single one of them, - they're deciding basically what questions - you're going to be able to ask in the future at write time. And one of the beautiful things about Observability 2.0 is that - you're declaring bankruptcy. You're like, I'm never going to be able to - predict in advance all the questions I'm going to need to ask. So I'm going to store these raw events - so that I can make those decisions at read time. Read time aggregation instead of write time aggregation. Ultimately, there are only really three types of telemetry data. There's the metric, unstructured logs, - and structured logs. If you look at what people are using out in the industry, 80 percent of all tools - are built on top of the metric. ROM tools are built on top of metrics to understand browsers - and user sessions. APM tools are built on top of metrics to understand APM performance - or application performance. And the reason for this is that metrics are... I have that sticker too. They're tiny, right? A metric is not a data structure, - it's a number. A metric is a number with some tags appended, - and you store it in a time series database. You store - absolutely no connective tissue. You can't do high cardinality. There's no context. They're extremely limited. And I have been talking shit about metrics for years. But I want to make it clear that it's not because I hate them. And it's not because they don't have any use cases. They do. There are some use cases that metrics are appropriate for. Anytime that you're trying to summarize a vast amount of data, - metrics are your tool, that's what they're for. In fact, going all the way back to... how many of you are old enough to remember R&D tools? Yes, a couple of people. So R&D tools were the original metrics tool where - you'd capture your data and then it would age out over time. You would retain the shape of it, - but you would say like, I'm going to spend five megabytes on this. And then it would be five megabytes forever, - even as the data aged out. So metrics are the right tool for summarizing vast quantities of data, - aggregating them so that they can cheaply age out. They're also at a certain scale. They're the only thing that you can use, - although that scale is typically a lot larger than most people think. Metrics will let you do things like counters that are harder to do with - structured data. And they're pretty good for infrastructure. What they are not good at is helping you introspect and ask questions and - iterate on your software. Which is why to understand our systems and our software, - we need to turn to logs. Even unstructured logs are more powerful - than metrics because they preserve some context, - but they tend to be very - messy and expensive. You really kind of have to know what you're looking for in order to find it. The only thing you can really do is string search, which gets slow. Next, let's look at who uses this and how. In Observability 1.0, - I think one of the most interesting implications of this is that - it's historically been an ops job. Software engineers would write their code and then heave it over the wall - for ops teams to instrument it, - monitor it, and understand it. I think the idea that half of your engineers will write your code - and the other half will understand your code, - has always been a pretty flawed proposition. But Observability 1.0 has always been very much about MTTR, - MTTV, and errors, and downtime, - and crashes. In other words, it's about how you operate your code. And Observability 2.0, it includes those things, - but Observability 2.0 is much more about how you develop your code. It's about hooking up those fast feedback loops. It's about more than just errors and bugs and downtime and outages. It's about understanding what you've built. It's about understanding how users are interacting with what you've built. It's about understanding what they're doing - that you never thought that they would do. What are they leaning into? What are they getting creative about and how they use it? There's so much about understanding our software and production that goes - beyond just, oh, it's broken, I need to understand it. I think we should want to understand it, - not just when it's paging us in the middle of the night. You don't know what good looks like if you're only looking at your software - when it's broken. Observability 1.0 is how you operate your code, - and 2.0 is about how you develop your code. How you interact with your code in production. So, in 1.0 land, - we would often deploy our code and wait to get paged. Software engineers historically have this idea that they can merge their code - and then their job is done. If the tests pass, my job is done, right? No, definitely not. Your job is not done until you know that it's working in production, - which means closing that loop by looking at it. So one of the interesting things that I think is that... I'm old enough to remember that - we used to take pride in not having to look at our code all day. We used to take pride in the idea that your code would let you know when - you need to go look at it and this makes sense when you think about - what we were reacting to, because we were reacting to the days of the NOC, - the Network Operations Center, - where you had poor saps, - sitting in data centers all day just staring at walls of dashboards. And we were like, well, that sucks. We should be able to automate this. Nobody should have to sit there looking at their telemetry all day. And so we went all the way over to the side of, you shouldn't have to look at - your dashboards or your code unless it's paging you. And this, I think, is flawed. Because part of this whole, - putting developers on call for their code, - making engineers support their code in production, - part of the agreement that we make is that it's not just, - we've suffered for 20 years and it's your turn to suffer now too. That's not the goal of it, right? The idea is that this is actually the only way that we can make software better, - by closing and compressing those feedback loops so that the person who - has the context about what they're building and why they've built it - in their head is looking at it in production and asking themselves, - is it doing what I expect? You can't break that feedback loop up into two different people. That has to be in one brain. But - part of the contract of making it so that people support their own code in - production is, it was never a good idea to just be page bombing people, - but now that developers are on call, - it's really not okay, right? We need to make it so that you're not getting woken up unless - users are in pain. You don't page people - unless your SLOs are being violated. You don't try and predict, ooh, we think our users are about to be in pain, - because you can't. You page someone and wake them up only when - users are in pain, according to your SLOs. So we're raising the bar about what is going to alert us, - which is a good thing. But most problems, - most issues with your code are not