Why should businesses embrace DevOps? Dan Garfield, technology leader, full-stack engineer and Chief Technology Evangelist at CodeFresh, explains.
My name is Lauri and I am the Chief Marketing Officer of Eficode. Not too long ago, we held a hugely popular two-day DevOps 2020 event. We had awesome speakers from around the world, telling stories about DevOps tools and culture for over a thousand people online. Since then, we have made these recordings available at the Eficode website and due to the popularity of these speeches, we have now made them also available in our podcast.
You can find the links to the video recording and to the materials referred in the speeches in the show notes. We're also keen to feature topics that you find interesting in the area of DevOps in this podcast. Do let us hear about you in our Twitter, Facebook or LinkedIn pages. Today, we are going to hear Dan Garfield from Codefresh. Dan is a technology leader and full-stack engineer, specialized in evangelizing containers, Kubernetes, Helm, Istio and related technologies. As a Chief Technology Evangelist at Codefresh, Dan leads communication marketing and forward-thinking technology initiatives. Let's get going.
My name is Dan Garfield and I am the Chief Technology Evangelist for Codefresh. We are a company focused on CI/CD, so basically the code delivery loop from the time that you write code to the time you get it deployed. And so, we're super laser focused on that loop and how we make that as fast and as efficient as possible. And it has, obviously, enormous benefits when you get it right. That's what we're going to talk about. Kind of, because of the recent changes in the world, we used to be thinking about how we could just move faster. It was, "Wow, we've got a lot of competition, we've got to move as fast as we can, we've got to be as effective as we can." And now, the questions are really transitioning to, "Well, how do we reduce costs?" And so, that's a slightly different conversation.
I actually want to arm you for both of those conversations and hopefully it puts you in a great position for success, where you can actually make the business case very clearly. There's a common problem with our industry, which is the engineering and the business, or at large, often don't really speak the same language. They tend to not understand each other, because the engineers want to talk about problems and solutions and kind of the technical challenges and architectures and things like that and the business really understands just dollars. The fact that you have a very elegant architecture doesn't mean a whole lot to them. They don't really care. They care if you can translate it into dollars, so that's what we're going to try to do. We're actually going to take a bunch of principles of DevOps, talk about different cases and then figure out how we can actually estimate actual dollar amounts associated with those things.
Now, just a disclaimer with all these things, the calculations that I'm going to share here should be fairly safe. Very thrilled to get a pull request if you think that you've found a little bit better way to do it. I would love to have your input to tell us that maybe we made a mistake. So far no one's found any, so, so far, it seems to be doing pretty well. So take that as a challenge. If you can find one, I'd love to have it.
All right, so let's get into it. We're going to go through four different ways that DevOps provides you value and calculate the dollar values. And we'll talk a little bit about some tools that can help, and I do have a link to an actual page where you can actually put in your own numbers to get your own calculations for all these metrics.
So first off, DevOps is really split into two disciplines. You have productivity engineering and productivity engineers are super focused on how they can make their engineers as effective as possible. Make it so that they can spend as little time as possible on the stuff that isn't just writing code, right? And then you've got your site reliability engineers, who are super focused on making sure that the application stays up, runs correctly. And these are really kind of the two focuses of DevOps, and so those are the two areas that we'll have a focus on.
From a metrics perspective, productivity engineers, your biggest and most important metrics are things like changed lead time. So how long does it take you from the moment that somebody starts committing code, to the moment it's deployed? That window, that's your critical loop that you want to shorten as much as possible. Then you've also got things like deploy frequency, which actually has a value all of its own that we're going to go into.
And then for site reliability engineering, it's all about change failure rates, mean time to recovery, those are a little more straightforward for most people. But from a business standpoint, productivity engineering is really about delivering value faster, which makes the business more money, right? So that's the way that they should understand the value of what's going on there. If they don't understand that value and you're heading into uncertain financial times, that's your job on the line, right? So you want to make sure that you make this business case to make sure everybody understands the value of what you're doing.
So, first off, we're going to talk about avoiding downtime. Now the cost of downtime is incredibly high. Costco had an outage, just last Black Friday, and it costs them about $11 million in losses, but it was like an hour or two, I think. Let's see, and then Amazon, if you think about Amazon, this is based on last year's numbers, they make about $200,000 per minute on amazon.com. So when they went down for 13 minutes, it cost them $2.6 million, which is no joke. So, that's incredibly costly. And Southwest Airlines actually, famously, had an outage a few years ago that costs them $82 million, because it involved people actually missing flights, things like that. So this is incredibly important to get right. So how do you calculate it?
Well, what you basically do is you take the annual revenue of your company and the minutes in a year, and that's the dollar per minute, and then you just multiply it by the downtime. Now, some of you might be thinking, "Well, should we really give all of the company revenue to the downtime?" Well, if you have a specific service, that's only servicing one section of the business, you can get a little bit more granular. You can say, "Okay, well this product line makes this much money. This is what it costs per minute." So you can get a little bit more granular.
But in terms of estimating the cost of downtime, because you're thinking, "Well, engineering does contribute to revenue, but so does sales, so does marketing." Well, unfortunately it's a three legged stool. So if the engineering part stops working, the whole stool falls over. And then if you're saying, "Well, what percentage of this, what's the cost of this?" Well, the cost is a hundred percent, right? So you don't need to split it up by the three different areas. You just give it a hundred percent cost, because that actually makes the most sense.
And the link I'll give you here at the end is called, it's codefresh.io/webinars/making-business-case-devops. And I'll put it in the chat, but we actually have these little calculators here, so you can actually put in the business value and it will give you, if you just put in the annual revenue of your company, it will actually spit out the value per minute of downtime. So if you're a $20 million company or a $2 million company, you can see what the cost per minute is of downtime. I guess we need to support some higher value companies in here, because everybody in the business will be like, "Oh, we do five nines of uptime, five nines, five nines." People don't really understand what five nines means. If they did, they would understand how hard it is. Five nines of uptime means you get five minutes of downtime per year. It's very difficult to achieve that. It's a bit like running the 24 hours at Le Mans, every day. You have to have a very robust engineering organization to achieve something like that.
And so, if people are saying, "Well, we should have five nines of uptime..." Yeah, I agree, that's great. We need to invest in it just like we would invest in winning a 24 hour race, where everything has to work perfectly all day, every day. They should be making the commitment like that. And this means that if you're Amazon, for example, and we're talking about five minutes a year of downtime, then we're talking about that $200,000 per minute, right? So we're talking about a million dollars a year, if you're doing five nines of uptime for Amazon, right?
So for your company, it's going to be a little bit different, but if you set a goal and you say, "Okay, listen, we're going to revamp the way that our architecture works. We're going to revamp, we're going to move to Kubernetes and we're going to have some better fail over in case services go down, better hot restarts and things like that, we think we can cut 10 hours of downtime per year." Well, now you can just multiply that by the downtime cost and you actually have a pretty good estimate of what the offer is going to be. So if it's going to cost you a million bucks at work to do that, but you're going to save 10 million bucks a year in downtime, the business is going to be very clear like, "Oh, okay. So I give you a million dollars and I get nine back? Deal. I get 10 back, right? Deal." So that's easy for them to understand, and you really need everything to be working amazingly well together to get to five nines of uptime.
That means it's not just about the architecture or the services, your continuous integration needs to be brilliant. Your integration testing needs to be brilliant. Your continuous delivery needs to be brilliant. And then your infrastructure needs to be brilliant. It's not just the infrastructure, it's actually your entire code delivery process, so that you can avoid downtime drain changes. Most downtime actually isn't because of hard drive failure, most downtime is because somebody isn't following a change practice where everything's written as code and they fat finger a configuration and apply it without having to go through a CI or CD process and then this causes downtime, right? In terms of the push and pull of DevOps, you really are pulling between that code delivery side and the uptime and the whole discipline of DevOps is to balance out those two things.
So this actually is going to get us into the second value, which is deploying more often. When you're doing quarterly releases on the right-hand side here, you have huge high risk changes. So the QA on these things is enormous. And many of you remember the way that this used to be, and maybe many of you still work in organizations where it's still like this, where you do releases once a quarter or once a month. And those releases have to have everybody's changes packed in, they all have to be working together. Now, if there's something wrong with that release. And of course, the surface over which things could be broken is very large, because there are so many changes, that those changes actually should be considered high risk. Now, if you're making small changes all the time, it actually reduces your risk.
And there's a good old saying, which is if you want to be good at something, do it more often. So that's the idea between daily releases. You can actually reduce your risk of downtime by releasing more often. And of course, that does require, in order to release daily, you have to have good processes. When you only release once a quarter, you don't have to have good processes, you can have crap processes and just do everything manually. Daily releases will enforce that discipline, so it's something you want to get to.
All right, delivering faster. And I'll deliver this talk faster, because I know time is of the essence. I have this quote from someone that I think a lot of, who said, "Undelivered code has the same value as unwritten code." Now, you might think that, well, if it's written, I can always deploy it. And that's true. But every piece of code that you write is meant to deliver business value, right? Your company is going to make more money because you deploy this code change. If you didn't think it was that way, you wouldn't write that code in the first place. There's no way your product organization is going to give you a project for something that they don't think has value. And if you're fixing bugs, that's going to help your company make more money, because the software's more and more reliable, people are going to be able to transact more, things are going to work better, right?
So we actually did a case study with a company called Steelcase. And I don't know if any of you know who Steelcase is, but Steelcase makes $3.4 billion last year. So every commit that they make and that you make, you usually don't know the dollar amount, but you could actually put it in. And sometimes you do know. There was a case study I worked on with someone else where they had a single change that was going to make the company an extra $10 million. And they knew that it was going to make an extra $10 million. Now you don't necessarily know the value of each individual one, but you can kind of make an estimate of it. And if you can take time off of how long it takes for you to deliver that code, that means you actually gain the value of that change faster, which means the company actually is going to make more money, because you can deliver it faster.
So think of it like this. If you have a new component that makes the company extra $10 million a year, if you deliver it three days faster, that means you actually start getting that value three days faster and you actually make more money total, right? Because it's in effect for a longer period of time of the year. The other comparison I like to make is that undelivered software, it's like inventory that's sitting on a shelf in a store. It's not making you any money. Once it sells, then it makes you money. If it's sitting on the shelf, that actually costs you money. So the longer a code change takes to get out, usually it actually costs money because people are reviewing it, checking in on it, that kind of thing.
So what Steelcase did, they've made about $3.4 billion a year in revenue, and they took their code delivery cycle from three days to three hours. So we actually could cut days off of the code delivery cycle here. And in this case, if about $300 million of value for the company was delivered per year, then reducing just three days, means that the company actually makes 2.5 million extra dollars because they deliver the code faster. And this is the way the math works. Basically you take the dollar amount of new business value from engineering. You multiply that by the hours or days saved, times the days or hours if you're using those units. And that's how you get it.
Now, if you're not sure how much value you delivered as an organization, it's a very rough rule of thumb, but a rough rule of thumb that you can take the total cost of all your engineering salaries and multiply that by eight. Now, that multiplier varies widely. There are plenty of companies where that multiplier is only like two or like 1.5. And there are companies where it's more like 15. So it does vary widely. This is kind of a rough estimate, so if you plug this in and you're like, "Hey, we deliver more business value than the company makes in a year." Well, okay, your multiplier is probably not that high then, right?
But that's a good kind of rule of thumb. And that's kind of what you're shooting for as a business anyway. Easy to put in, you just put in the annual business value and then the time saved delivering software. And most often I see companies, when they're adopting DevOps, the big change that they're making is actually in days and not in hours, is how they're measuring it. I mentioned that we were working on this code delivery loop. So this lead time, it changes, what we're actually trying to shorten. And if you can do that over time, that has enormous value to the business. It also has a huge psychological value to your engineering organization, which also has a value all of its own that you could probably translate it to dollars if you make the effort.
All right, three, saving engineering time. So if you can cut build times, and we'll go into a case study on this one here. If you can cut build times, builds often derail engineering, right? The automation process, "Okay, I got to start a build," and then they go and play ping pong or they go take a break or whatever. And sometimes they would have taken those breaks anyway, but oftentimes because builds take a long time, they end up taking longer than they would have. So to calculate this, all we do is we take the number of engineers, the average amount that engineering makes per minute, and we do usually want to include all in costs. So lots of times, you'll just take salaries, but you actually want to include, usually there's about a 20% overhead to employ somebody, like to pay for healthcare. That's a US joke for you. To pay for the building costs, their desk, their equipment, whatever. It's usually about 20% over their salary.
So you want to include all of that and you want to take the average number of builds that are done per person per day, and then multiply that by the number of minutes that you're going to save by making a change. And if you have a thousand engineers and you cut five minutes off of your bills on average, that's $8 million in time that you just gained, so it's incredibly valuable. And in fact, oftentimes you can actually make much larger gains than five minutes. We did a case study with Hover, and Hover actually, Ethan here gave a talk at the last CodeCon and in their talk they actually showed how they went from two and a half hours in their bills to 12 minutes. Now they don't have a thousand engineers in their organization, but they do have, I think, about 30 or 40. So making that calculation for them, definitely a no brainer that they were saving a lot of money.
And so, just a reminder, here's the math on this. And I think the slides are going to be available after this, so you don't need to worry too much about it. And again, I'll link to the calculator on the site after this. But just to show you what it looks like, it gives you an overhead option and you can put in your average salaries and get the total amount of savings expected.
All right, number four, reducing overhead costs. If you go to Cloud-Native, if you're adopting something like Kubernetes, it's going to allow you to be a lot more efficient with your infrastructure. Most infrastructure on average is spending about 30% of the cost on IO machines. And that's something you can really reduce with something like Kubernetes. You can use spot instances much easier. You can do auto scaling. You can do tighter packing on the notes that you do have. So cloud native is incredibly valuable. So if you're trying to, and a lot of times people are saying to me, "Hey, in these uncertain financial times, I'm getting pushback about adopting something like Kubernetes." So you really need to pitch it not in terms of like how cool the infrastructure is, you need to say, "Hey, this is going to actually help us reduce our infrastructure cost, because we can do these things."
Now, one of the big questions with this is build versus buy. So we talked about how you can make different calculations so that you can justify different projects that you want to do, or maybe just justify the team size that you have. There is a classic question about build versus buy. And right now people are saying, "Well, maybe I should just build this thing, because it's cheaper." Sometimes that's true. Sometimes it's more expensive. This calculator is at baremetrics.com/build-vs-buy. It's my favorite one, so that's what I recommend to use.
This is another case study where this customer actually was using some hand-rolled deployment tools built on top of Jenkins. They had two engineers that had worked on creating their hand roll tooling, and then they needed to maintain it. And they were trying to decide if they were going to spend $47,000, a good size enterprise company on a Codefresh license. Now their cost to build it was only about $68,000, but the cost to maintain it was about $200,000 a year, because of the percentage of the engineer's time that they were going to have to spend keeping that up and running. So in this case, they basically saved almost a quarter million dollars by deciding to buy a tool off the shelf to help them accomplish their objectives, rather than hand rolling it. So check out that calculator, it's very useful.
So in summary, don't be afraid to try to put cost to paper. As engineers, lots of times we get caught up in the idea that like, "Oh, is this the right number? Is it like this? Is it like that?" It's okay, we're going to put an estimate on there and we can have a conversation around that, because if you don't have an estimate, it's very hard to have a conversation with the business, or at large, that's going to make any sense to them.
DevOps is just incredibly valuable as an investment for your business. So you want to speak that language that they're going to understand. Use the equation that we've shared, I'd love to have your feedback, again, that's the link to the calculator there. And just some related resources that I'd recommend there. Some talks about CI/CB pipelines for microservices. These are on codefresh.io/events. We have a ton of these.
So with that, I will close and we can open questions and play the video and I'll watch the questions over chat. Thank you.
That was Dan. Be sure to check out his light and calculators from the DevOps 2020 videos and the show notes, they are really worth it.
Next time we are going to hear none other than Kohsuke Kawaguchi, creator of Jenkins, co-CEO at Launchable. Kohsuke is a passionate about developer productivity. Early on, he has helped discourse as a developer, building numerous open source projects, most notably award-winning Jenkins. We surely will see you next time. Until then, don't be afraid to put costs to paper.