Talk: Data-driven DevOps - The Key to Improving Speed & Scale by Kohsuke Kawaguchi, Launchable

Not too long ago we held a hugely popular two-day DevOps 2020 event. Due to the popularity of the speeches, we have now made them also available in our podcast. Kohsuke Kawaguchi is the creator of Jenkins, and co-CEO of Launchable. He is passionate about developer productivity. Early on, he has helped this cause as a developer, building numerous open-source projects, most notably award-winning Jenkins.

Lauri (00:03):

Welcome to the DevOps Sonar. My name is Lauri and I am the chief marketing officer of Eficode. Not too long ago, we held a hugely popular two-day DevOps 2020 event. We had awesome speakers from around the world telling stories about DevOps tools and culture for over a thousand people online. Since then, we have made these recordings available at the Eficode website and due to the popularity of these speeches we have now made them also available in our podcast.

Lauri (00:36):

You can find the links to the video recording and to the materials referred in the speeches in the show notes. We are also keen to feature topics that you find interesting in the area of DevOps in this podcast. Do let us hear about you in our Twitter, Facebook or LinkedIn pages.

Lauri (00:52):

Today, it's a big day. We are going to hear Kohsuke Kawaguchi's speech from the DevOps 2020 event. Kohsuke is passionate about developer productivity. Early on he has helped this cause as a developer, building numerous open source projects, most notably award-winning Jenkins. Kohsuke's topic today is data-driven DevOps: the key to improving speed and scale.

Kohsuke (01:18):

I'm probably best known as the guy who did Jenkins, because it is an open source project that a lot of you have probably heard of. It's got this mass administration around the world, and I'd like to think it's been helping teams everywhere around the world. I've also used that to kind of start with CloudBees, which is a business around Jenkins, at least at one point. Now it's a much bigger company than that.

Kohsuke (01:47):

But at CloudBees, I've been a part of helping enterprises everywhere, doing DevOps and better digital transformation. And then I also recently sort of... I switched the gear once again, and then started working on this new small company called Launchable. It's around smarter testing and faster DevOps, and I'll get to touch a little bit what I'm doing later because naturally it's kind of aligned with what my passion and interest has been and then that will be the topic of today.

Kohsuke (02:21):

I think if you look back even just a few years, I think the automation came a really long way. In fact, when I started engineering, a nightly build was still a new thing and the build was done by the actual people, these engineers. So what started as something simple like that, and then just the idea of running tests lightly, then continuous integrations, and continuous stability so on and so forth, but all these automation got a lot more sophisticated and much bigger. So I'm pretty sure many of you today have some form of automations, perhaps like this. The build and test and deployment and the DCs. And I hope you see yourself being there, but I'd actually contend that you might think this is what you have, but in reality, what you have is more like this. It just doesn't fit into this nicer picture and lines, and there's so many processes here and there. Some of them, you probably haven't even noticed because it's been done by other teams.

Kohsuke (03:36):

When we talk to people doing software development, these are the kind of the world that they live in. And then when you are... In some sense, all of these complicated pieces, they are automated, at least they're scripted, so only a small part of it is driven by humans, but because it's so widespread and spun around across so many different parts, it's often difficult to make sense of what's going on, right? It's got this feeling of trying to watch how individual bees are behaving in a beehive, and they use that to understand the bigger... The hive behavior.

Kohsuke (04:16):

Well, that simply... It's just not possible. Often I feel like that when I show up in some workplace and people show me their linking systems and there are all of these jokes. That's doing something, easily. I just don't have a bigger picture of what's truly going on. So after sort of seeing so many of these company, I started feeling that there are two kinds of companies or teams here. The one is donkeys the other is unicorns. They kind of a look alike, but they are very different. And I started feeling, so naturally, what makes their differences?

Kohsuke (04:57):

One of my hypotheses that I'm sort of coming down to is it seems like one of the critical difference is how they use the data to drive the process itself, and what they mean by data. In some sense, if you think about it, all these automations, all these individual boxes, are producing lots of data that comes out of this automation, but we are not really using them, aside from the time when they fail and you kind of have to look into it. And it's starting to feel like it's a wasted gold mine.

Kohsuke (05:33):

I don't know if this is true or not, but I've heard that... I've read somewhere that e-waste, like old computers thrown away, by volume, contains so much more gold and precious metals than a gold mine. So the argument was that it might actually make sense to you, like mine these as opposed to the actual gold ore. I feel like this data that we produce out of this automation has a similar potential value that's currently not utilized and just basically wasting the space in the trail box.

Kohsuke (06:12):

Let me look at some example of the concrete cases. So imagine yourself and your company that has hundreds of engineers working on tons of projects, and then this company has, once they're linking infrastructure that's doing everything from building this thing and whatnot, and then they are spending hundreds of thousands of database cost to get all this building test execution going. So it is fine in Silicone Valley startups world. The costs is not necessarily the top of the concern.

Kohsuke (06:49):

They can always come back and put that in control, or so the sort of thinking goes, but when this company went closer to IPO, the CFO had to again get that grip in order. And then he started noticing that oh, this money. They were spending lots of money to invest, exactly what is it used for? How valuable is it? Is there any way to cut back costs? And those are all natural questions, and given what's going on in the world today, it is forcing me to do this from my home. I have a feeling that many of you will start hearing these kind of questions in the coming days, if you haven't already.

Kohsuke (07:27):

In this case, well, it turns out that the company really was clueless. So if you are involved in this kind of build infrastructure and you have so many different workloads coming on, it's hard to know which one of them are costing all the money or what they're being used for, or let alone whether that needs to be contained or not. So what should have happened was maybe to provide some level of visibility into the cost at the project level. So in this Jenkins instance forum, the central DevOps team provided three kinds of BM instances as a template and then simplified by small, medium and large. So the instincts of DevOps' course in the project was naturally... . So they might start with small, but as soon as something doesn't work out, like a flaky test or the build is taking too long and this is frustrating, they bump it up to some bigger instance and then there's simply no incentive to bring it back into the smaller instance or look at the program, because the economic signal is lost.

Kohsuke (08:40):

So what should have happened is if the Jenkins had a way of making the workers aware of the trade-off that they're making in terms of time gain, versus the cost reduction, then that'd help them pick the right thing. If it's a PR variation, people might choose to value the build time faster because like the previous speaker said, people waiting has a lot of cost, but if this is something like a nightly excuse, or maybe it was okay to go for the smaller instances and take a longer time, or depending on the activity that a lot of the projects, these things could also evolve over time.

Kohsuke (09:19):

So this is a very simple example of providing data that the system should already have. That driving the right behavior and then making your organization just a little bit smarter. Right? And it's not that hard. And then that's what I mean. There's lots of low hanging fruits like that. Here's another example of where I felt like the data could be used into a good use. So now imagine a little bigger company. So thousands of engineers working on a massive embedded software and here also they have one DevOps team that runs every infrastructure. So what happens here is because they have such a massive paralyzed build and test form going, when the build failed, there can be two reasons.

Kohsuke (10:08):

One is the failure in the applications or the testing, in which case the product engineers need to be notified, but sometimes these failures happens because of a infrastructure problem, whether a gerrit server dying or the gitabase being inaccessible or the database or the test environment is crap. So in those cases, it tends to create this massive failure across the board, because every one in every test and every ability dies on those. And then as a infrastructure engineer, the DevOps team, you don't want app developers to be bothered by these failures that they can't act on.

Kohsuke (10:50):

So there's this desire to send a notification to quote unquote the right place. So how can you do that? In fact, multiple people, independently, the team that I talk to, solved this problem in their own way. One team, they deployed the simplest tool known to the software engineers. It's a regular expression, maybe though it's a workflow and that keeps them very, very... So what this team has done is to look for the common failure patterns like whether the stack trays or the log messages, and then if the failure matches that, then the notification is going to happen several times like the DevOps team.

Kohsuke (11:38):

And then another more sophisticated team... I believe this is actually a team, not a company. They deployed their basin filter, which is known as... It's a statistical tool that originally used for the spam filters. So the idea is they train this basin filter by doing, this failure should go to DevOps team, this failure should go to our team. And if you do that enough times, the program itself, the filter itself, starts to pick up the queue and they start doing the things. And then the beauty of it is when they deliver this email notifications based on the guests that these basin filter did, there was this button that says Not My Program. So it's like the app developer incorrectly got emailed, then they can press the Not My Program button, and that teaches the filter that, okay, it has misclassified. So it was a very clear work system.

Kohsuke (12:36):

I'm sure it's not a big deal, right? I mean, this isn't like building a massive system, but even something... Again, even something simple like this has a substantial impact to the productivity and the credibility of the DevOps team. And to me, that's another example of the data being used, improved software development productivity. So as I was thinking about these examples, and then this is also around the time that I became CTO at the CloudBees. So what I realized was using data effectively is incredibly important at the initial level. Right? So I'm sure many of you have these frustrations, that... You know what are the right thing that needs to be done to improve your software delivery processes, but you're struggling to get that communicated to your boss or rather the organization around it, and it just does not get prioritized.

Kohsuke (13:34):

So as a leader, I think this would be the easier job, to convince the organizations that this is the right effort for the benefit of the organizations. And data and the associated story really helps your boss and the stakeholders see the program that you see. I think we often have this tendency in the software engineers to rely on the belief and common value system to sort of skip the arguments, right? Like many of the things you do, even stuff like that in the DevOps, or a testing, it's often difficult to quantify. And so we tend to say, oh, those are good things, self evidently good things, and we tend to stop there. But of course, then it's no longer that the people who don't come from the same background and don't see the reality as clearly as you can not to get it.

Kohsuke (14:27):

And so data, I think, used to be a common language in the business for the longest amount of time to breach that gap. And I think second from that, data also helps the outside, therefore, to the right place. I mean, I think we also all need to be humble that where we think makes a difference could be also wrong. And then so in some sense, it's a homework to improve yourself. That what you think needs to be done actually needs to be done. And finally, many of these efforts that we've been talking about, especially in DevOps, tends to take a long time to start making an impact. So data can also help you show the impact of your work before it sort of start turning into more tangible things. And then so that helps people... That helps stakeholders feel better that their investment is getting a return.

Kohsuke (15:24):

And then, so that sort of in turn enables the continuity of the investment, and then increase your credibility so that when you need to take on the next step, any task for the next investment, then it comes a little more easily. So I think in many ways the data played a key role. And I think we cannot underestimate the importance of that. If you think of the software as a kind of factory... I know this is all done already, but essentially we turn the ideas and thoughts into a functioning zeros and ones that runs on the computer and the in-between is what we then reconsider the software factory. So what we need to get to, the picture that we need to, is really the notion that we need to observe what's happening in this software factory, and then use that information to feed the continuous improvement and learning. And in the manufacturing domain this is often called kaizen, or lean. That's one of the proud tradition that comes out of the family I got.

Kohsuke (16:29):

So that's kind of it, right? This isn't also groundbreaking, earth-shattering ideas. It's been said and done in many places before. But if we look at the software development from this perspective, this continuous learning and improvements, it makes sense that then some people started applying machine learning or the statistical approach like what I mentioned, the basin filter, to that kind of program. And then, so this is where I see a somehow more cutting edge effort are happening. So, the one thing that immediately blew my mind was this smarter testing. So, for this suggestion, imagine a big company. You're in the DevOps teams. The business has been hugely successful, so you have a massively, but modularized code base.

Kohsuke (17:27):

So it's not a complete spaghetti. It's been taken well care of. It's just so big. And then more importantly, because this project has been worked on by so many people over such a long time, it amasses a lot of big time-consuming tests, collectively, that's in the range of the millions. So, at this scale, the challenge becomes time and the cost it takes to build and then test this software. So the team was trying to cut down costs and reduce the time it takes for developers to get the feedback for their change. So the first step of their work was somewhat easier, at least for me to think about it. It is a dependency amongst... So imagine you have files at the bottom of this diamond graph, the diamond picture, with multiple files getting grouped into modules, which is the squares in the middle, and then these modules are often tested.

Kohsuke (18:27):

So you would use the orange circle to represent that. So you can compute them... Well, most of the build tools have this dependency information within so that they can analyze and run inferences on top of these things. You could statically determine when these files have changed, these are the module that needs to be built, and these are the tests that should be run because they might be impacted. So the first step, they have done this kind of work. As far as I know, on average, when I go see teams, many of teams aren't even doing that. People are generally just running the whole fresh deals from start, redoing everything from scratch and that's adding a of dead time. So the fact that this company, this big company, went through the step one is great, but what's truly amazing is they didn't stop there. Because the scale is so big that this wasn't simply enough.

Kohsuke (19:27):

So the next steps, they did what we call the predictive test selection. The idea is they train the machine learning model that predicts what are the useful subsets to the test to run. And then based on the information about the changes that came in, the model predicts, okay, let's run this subset because from the historical behavior, we think these are more likely to chuck degressions. And then they can get that kind of historical behavior because their system is processing series ten to the power of five. So that's hundreds of thousands of changes for months. Is that right? Did I get that maths right? This is just a sample, that 1% of that, they run the show field and they run the short test to train the model. And then this has a remarkable impact. They reported that the model was able to select about a third of the tests, and then only misses about point one percent of the broken chains.

Kohsuke (20:24):

And then so with minimal cost of the efficiency, they were able to cut the risk cost in half. And you can imagine, in this kind of a big company, this would be a simple money saving. And not to mention... Not only the feedback time reduction. So I started thinking, wow, this is amazing. And then this clearly is useful beyond big company, because after all, how many companies in the world are they working at that scale? I think the predicting the probability of a test failure has many uses. So I personally... In my own project, I had this experience where we had to wait for actually an hour for the CI to clear your pull request, and then the code reviewed start. And then once the code review is all finished, that's going to take another hour of the whole test cycle before that change can get in.

Kohsuke (21:19):

Or I know many of you work in places where there's a massive night integration tests that takes multiple hours. It's so much to the point that you can be the only run every night. So these are unsexy reality that we hate to talk in conferences, but I hear it the other week that I've been one of those people. So, if you could predict the failures of individual test cases, then what we can do is essentially absolve things so that we run the high risk test force, and that too results in massive reduction in time doing forced failure. As a developer, as soon as you get the first failure, you get to work on that fix. So that'd really be helpful. Or you could also just extract it. Make a high value portion, like a seat and then run them. And that creates more meaningful adaptive subset of the changes.

Kohsuke (22:14):

And that can have so many different uses. And this is, to me, the events that drove me down to this path to Launchable. So I've been working on this project more head-on now. Now, this program is interesting to me. I love to, for instance, swap notes and share ideas. So please drop us a note in here. Okay. So, I also need to hurry up here. So here's another example of a company using machine learning to make an impact. So this one is more by a survey team. And in this company, they are pretty far along in the continuous delivery process. So, they have hundreds of apps deploying, on average, one deployment per day. So that's pretty awesome. On the other team, there are so many things funding only the production, and you're the first one to observe the failures and fire fight that.

Kohsuke (23:16):

So how can you do that better? Can we flag the risky deployment before her? So here, they also use the similar techniques to train the model based on the past deployment records. So they say about a one ear war to deployment, which has 40,000 deployment records, of which only hundreds are failures. So I was thinking, okay, that's actually amazing success rate of the deployment. And then here they're even trying to get better. And if I got their numbers right, which, to be honest, felt almost to be true, but if that was to be believed, they said the model was able to predict 99% of the failing deployment. And then at the rate of only 5% false alarm. So it's pretty darn good. The model was able to... And then they were able to extract the insight out of the system before they train this model.

Kohsuke (24:19):

What they are doing is asking the developers what they thought is a low risk chain versus high risk chains. So they use that information as signal, and then the model was able to conduct a quantitatively... Show that this developers' conception was actually not valuable or that most outages happen when the codes change the approval has a short time window. In other words, when the change is rushed, then the program is more likely, or that the long maintained code, the code base that's been around for longer, tends to be more riskier. Now you might think, well duh, none of these things sounds particularly... We all knew instinctively, right? This simply isn't a surprise. But I think that's where you're sort of underestimating the power that data... So in this case, this company was able to put an actual number on it.

Kohsuke (25:22):

So not only they can say long maintain code is more risky, they could say that every month that the code base is living longer, the deployment is fading, say 10% more likely. And that translates into X dollar per year or the loss of sales opportunity or something like that. Right? So that message doesn't need a receiving end to know the value of the refactoring or a constant coordinated... They can simply put the dollar per year to justify the necessary effort, which might be, again, uniting these old services. But it doesn't take any belief in face. So that's, I think, the power of the data. So some of these things... Seeing some of these success stories got me thinking like I used to think as a smaller, nimble team of elite developers can do a whole lot.

Kohsuke (26:18):

And then these large, big companies are more slower, like a Goliath. They can't really move fast. They are sort of constrained by the processes compared to these small elite teams. It's like David versus Goliath. And I felt resonated being a part of a startup, like in baby kind of company. But one of the common traits between those last two stories is that in both cases, they utilize the power of data at scale. Right? So I was no longer sure which is donkeys or unicorns. These smaller teams never be able to amass that kind of data to drive the productivity gain that this larger company did. That really sort of shaped my confidence. Those are sort of 3D inspiring example. And not many of the companies are quite here yet. But we're all trying to go from this like donkeys.

Kohsuke (27:29):

TI know you don't want to admit that's who they are, but that's the reality. So, yeah. We also... What was I trying to say? Yeah, so I know some traits of the common donkeys that exist in here. If people are doing things differently, et cetera. And then I think some of the key traits of the unicorns seems to be... everyone is doing things in one way that allows the DevOps teams to more autonomous and take control of decisions. And then the other teams feel like these lower level concerns quote-unquote bullshit is taken care of by the DevOps teams. And that seems to create the positive feedback cycle. So with that note, I think the automation is a table stakes that everyone is doing, but I think the next step of the journey is to use the data from that automation to drive progress. So I'm looking forward to hearing more of the stories from people. I can't wait. Okay. On that point, coming back to you.

Lauri (28:32):

That was Kohsuke. I have to say what you really miss in this podcast is Kohsuke's awesome unicorn graphics. Check them out in the DevOps 2020 videos. Link can be found in the show notes. That's all for today. What I say to you is don't be a donkey.

Published: August 3, 2020

DevOps

Eficode

Subscribe to our podcast

Related tracks

Talk: Cloud Transformation at scale by Jakob Knutsson, H&M

Jakob Knutsson is responsible for cloud adoption at H&M. As a leading expert on Azure Cloud, he gives his take on how to manage a situation where IT operations must meet the demands of businesses.

Go to full transcript

Sauna Sessions

Field day: Key takeaways from KubeCon 2023

Enjoy some key takeaways from The Cloud Native Computing Foundation’s (CNCF) flagship conference in this short DevOps Sauna episode!

Go to full transcript