Continuous integration (CI) is an excellent source of fast feedback for teams. When development check-ins are made, the automated tests are able to verify that the system is still working as expected. However, for many teams, one or more of their tests ultimately fail, causing the entire build to break. If this happens frequently enough, the team begins to lose faith in the tests and no longer regards the red build as an indicator that something is broken (other than the tests themselves).
Enjoy a replay of The DEVOPS Conference sessions conveniently through the DevOps Sauna podcast. This talk is hosted by Angie Jones, Java Champion and Master Inventor, and Sr. Director Developer Relations at Applitools and Test Automation University.
Hi there and welcome to DevOps Sauna podcast. Eficode has organized The DEVOPS Conference where it's predecessors for a number of times. Last time in April of 2021, over 10,000 people registered to the event. The DevOps Conference is a community event where attendees do not have to pay for the participation. And we are happy to organize it again on March 8th and 9th, 2022. You are all wholeheartedly welcome to join the event and register at thedevopsconference.com. If you are a DevOps Agile or cloud practitioner or a decision-maker, I also encourage to review our call for papers. We are introducing a whole new community track to give more stage time for people with their personal experiences and insights. All of the speeches from the previous time are available online to watch. But to make it easier for people to enjoy these great speeches, we are also sharing them through our podcast. I hope you like them as much as I do.
Our final talk for today comes from Angie Jones, a Java Champion, a Master Inventor, and Senior Director for developer relations at Applitools and Test Automation University. And she will sum up why continuous integration is not only a technical issue but also a human issue. I love how all of this comes together for us and also how to best prepare your team for a culture of continuous integration. Welcome Angie!
Hello everyone. I'm Angie Jones and I've come today to tell you a story about the build that cried broken. But first, let's talk a little bit about continuous integration just so that we're all on the same page. The process starts when a developer checks their code into a source control repository, then that code is integrated with the existing code base and built into what would be the new version of this product. And then from here, the automated tests are run as part of this build process to ensure that whatever new changes came in, they don't break anything. And finally, a report is generated, which informs the developer if her check-in is good or not. As I'm sure you all know, all of these pieces must work together in perfect harmony for this to be a healthy and reliable process.
However, there's one piece here that gives most teams trouble. Can you guess what it is? Yeah, it's the test. Fortunately, there's an ancient Greek storyteller by the name of Aesop who seemed to know a little bit about continuous integration, even all those centuries ago. Aesop left us with a few classic fables that we can apply to our CI tests, such as The Shepherd Boy and the Wolf, also known as The Boy Who Cried Wolf. As the story goes, there was a young shepherd boy who was responsible for looking after the sheep in his village. And the thing about this kid is that he was a prankster. So just for kicks, he'd often yell out, "Wolf, wolf." And all of the village people would come running concerned about their sheep, only to find the boy laughing hysterically at this little joke of his. Now, this went on for a little while and the villagers fell for the prank the first three or four times that he did this. But eventually the villagers became hip to his game and they disregarded his future alarms.
Well, one day a Wolf really was coming and the boy, he screamed at the top of his lung, "Wolf, wolf." However, no one paid him any attention and the sheep, of course, perished. For many teams, their continuous integration builds have become just like this young shepherd boy, they're crying, "Broken, broken." And in a state of panic, team members, they run to assess the build. Yet time and time again, they find that the application is in fact working but the tests are faulty in giving false alarms. So what happens here? I'm sure many of you have been in this situation. Eventually, no one pays attention to the alerts anymore because they've lost faith in what was supposed to be a very important indicator.
This used to be my life. I was working as one of many automation engineers on a development team and our tests were in pretty bad shape. So much so that just like the villagers, we stopped paying attention when the build cried, "Broken, broken." And when the build was read, we'd automatically assume it was a problem with the test and we just ignore it. So one day our product folks, they needed to demo the application to a potential customer, and the demo failed. Gasp, I know. So our manager questioned us about this and asked why haven't we been testing these important features. And we looked through the 4,000 tests that we had and there were tests for this. And yeah, they did indeed fail and tried to let us know that this was broken but we missed it because we stopped listening to these broken build a long time ago. So we had completely lost trust in our CI test, and that day, my manager demanded that we fix this. So take this journey with me as I share just how we did this.
There was a Fox walking along a trail one hot summer day, and the Fox became thirsty and he noticed this well of water nearby. So he hopped into the well to get a sip. And once he quenched his thirst, he tried to climb out, but he was stuck. And he tried and he tried with no luck and he eventually gave up. But don't worry, he didn't drown. A bit later, a goat was strolling along this same path and heard some noises from that well. So he saw the Fox in the well and he looked down he said, "Hey there, Mr. Fox, what are you doing down there?" And the Fox responded, "Oh, I was thirsty so I came down for a drink, the water is delicious. You should have some, join me." So the goat jumped in the well, the Fox jumped on the goats back and was able to escape, leaving the goat down there to struggle.
The moral of this story is to look before you leap. Our team wanted to do continuous delivery and everyone told us, "Yeah, it's great down here, jump in. You'll have to automate your tests and ideally within the sprint, but yeah, come on in." And so we jumped in. And automating the tests became our focus. This was driven by the metric of 100% of our tests must be automated before we could close out the sprint. But this shouldn't have been our end goal. Right? We're doing in sprint automation, but the builds are red. And yet we're steadily adding more and more features and more and more tests. Why? We have 4,000 tests that we ignore. Why would we continue to add more? Right? Information is worthless within an ignored build. Looking at it in hindsight, our true end goal was to obtain fast feedback, but our judgment was clouded by these arbitrary metrics. We had lost our way.
One beautiful afternoon a lioness and vixen took their kids outside to play. As the children were off playing, the mothers began to chit chat, and you know how moms are. So they can't help but to brag on their kids. And as they were doing so, it became a bit of a competition where they were trying to one up each other with the brags. So things got a little bit heated and the vixen took a verbal jab by saying how great her fox was and telling the lion, "Too bad that you only have one, I have a whole litter." And the lioness, she talked about how strong her Cub was. And she thought about it and she had a rebuttal for this vixen. She paused, she held her head up high and she said, "But he's a lion."
The moral of this story is quality over quantity. We were adding all of these tests and not taking a moment to think about, what were the right tests to add? Which ones would provide valuable information? Again, if our goal is fast feedback, having thousands of low value tests running isn't exactly in line with that. So the questions we should have been asking ourselves are, based on risk and value, which scenarios would cause us to stop an integration or deployment? Which ones exercise core critical functionality? Which ones cover areas of the app that have historically been known to fail? Which ones are providing information that's not already covered in the other 4,000 tests that are in the pipeline? And once we've narrowed the quantity down, we can truly focus on the quality. Because if there's one lesson that I absolutely learned the hard way, it's that the tests that are running as part of CI/CD must be of high quality because flaky tests are the death of a continuous integration process, they're a sure way to lose all trust from your team members.
And I want to spend a little bit of time on this because so many people preach that we should avoid flaky tests, but very few will tell you how to. So I'm going to share some of the techniques that we used to strengthen our tests. Surprisingly to probably no one, many of problematic tests were the UI ones. So as we were cleaning up the tests, some of the most common issues were things like no identifiers on some of the web elements. So needing to use CSS or XPath selectors instead. When using CSS and XPath selectors, there's relying on page structure was something that I saw that got us into trouble. For example, scripting your test to click the second button, that could be a problem when the structure of the page inevitably changes. Right?
The use of identifiers that are most likely to change such as a label text or something like that. And then in modern web development, the DOM is dynamic, so our identifiers needed to be smarter. Now, with this new fancy DOM and asynchronous applications, it also became pretty important to incorporate intelligent weighting in our tests. So we removed all of the hard coded sleeves and weights. And let me tell you, if you're still sleeping in your test, wake up. Trust me, this is not what you want to do. With our 4,000 tests, even adding blind one second sleeps would drastically slow down our build. And remember the end goal is fast feedback. So we changed the test to wait conditionally for whatever it is that we were expecting to have. For example, waiting for loading indicators to disappear or waiting for Ajax to finish executing. Right?
Sometimes we needed to wait for things that weren't happening at the UI level. So for example, API responses or database transactions. And many of the testing tools have, like sophisticated APIs, that will allow you to wait on a condition to be met versus just blindly waiting for a specific amount of time. So we used those. And we also realized how difficult it is to manage test data, especially across thousands of tests. For example, let's say that our test searches some movie database for a movie that has two parts, so there's a part one and there's a part two.
So we write our test with this expectation and then eventually a part three is released and added to the data. And now all of a sudden our tests are failing. No one wants to chase the test in this kind of way and constantly update them based on test data. So we had to determine, what's our source of truth? Is there an API or a database call that we can make that gives us this answer? And if so, we can utilize that in the test assertion as opposed to hard coding expectations that could very well change. So instead of hard coding that we're expecting two releases of the movie, we can just consult the source of truth.
If I'm interested in testing anything on the UI as end user, doing that flow, trying to essentially take those steps on the UI, not a good idea, especially if there's a lot of setup and stuff like this. For example, let's say that I wanted to write a test that is adding a product to a cart. Right? All right. So if we think about the steps of this scenario, I need to go on the UI, I have to search for the product, I have to find the specific product within the search results, I have to click on that product to be taken to the product page. From there, I'm going to have to add the cart to the button and then go to the cart and then finally I can test what it is I wanted to test. So we've added it to the cart, finally we get to the cart and we're able to test what we wanted to do.
Well, there's a lot of steps involved before that. Right? And if our end goal is fast feedback, we're violating that again. Right? So we can use things like code scenes and shortcuts within our application. Maybe instead of trying to do all those steps on the UI, I can call an API that'll add this item to the cart and then just use the UI to test that. OK? All right. I'll make sure I remember where I was. So we've eliminated the dependencies on the other features, we've eliminated the flakiness with the UI testing. So all of this is helping our end goal. And I'm going to say that a couple of times just so that we all remember it because it's really easy to forget, what's our end goal. So if we know what that is, fast feedback. And yours may be different, but knowing what that is, you can align your actions to that. All right. Story time.
So there was an astrologer and this guy he was really deep, he was so into the stars and the moon. And he didn't bother talking to fellow humans much. Well, the astrologer, he was out one evening for a walk and he was gazing up at the stars in an attempt to read what was coming in the future. And just as he thought he was making sense of the constellations, he fell in to a big puddle of muddy water. The moral of this story is to pay attention to what's in front of you. So we were so gung ho on planning for the future that we neglected to pay attention to what was right in front of us. Everyone on the team was focused on new feature development, adding new tests, and we didn't spend any time trying to monitor the existing tests or maintain them. We didn't take the time to listen to what the tests were trying to tell us, what patterns have revealed themselves. And we had come to value new code over working code.
With this realization, we understood we had to take a step back because, you see, tests are living, breathing creatures. They must be monitored daily if possible. So our immediate goal was to ensure that the builds are green. Well, we had hundreds of tests that were in bad shape, so this couldn't be a little, I'll fix the test when I have a spare moment type of deal, no, we needed to focus. And to accomplish this, we decided to create task. Every sprint for my monitoring the test. And the rule was, anyone can take it, but of course that meant just the folks who were responsible for tests. But in our two week sprints we'd have two tests, one for each week. And someone was basically on duty each week to monitor and maintain the test. And this was their sole responsibility for the week, so no development, no writing tests. Your sole responsibility is to fix broken tests this week.
Now this got us some great momentum going, however, with so many tests failing on every single build, it became really difficult to even know which ones do I go after. We were in bad shape. So we decided to keep our test results in a database, nothing major. It was one table, five columns. But this helped us to triage better. So anytime a test was executed, the test would write its status to this database. And what this now gave us was a way to gather statistics around the tests. So we could do things like sort the results to see which one is more likely to fail. Right? This can indicate flakiness or problematic areas in the application. So this was a great starting point. Or we could see which ones have failed for the very first time. That's interesting, right? Because these are more likely to be the real bugs in the product. This test has been working great before, now all of a sudden it fails. That might be a cause of concern. So this was extremely helpful in figuring out what we should be focusing on.
All right, let's talk about the farmer and the stork. The fsrmer, he was having a problem with a flock of crane birds that would come and they would dig up all of his planted seeds. And after this happened a couple of times, he decided to plant a trap for them. Well, the crane birds were on their way to the farm when they ran into the stork. And they invited their new friend to come with them to get some food. And everybody loves free food, so the stork followed along without much hesitation. When the birds got to the farm and it began to pick at the seeds, they were captured by the net that the farmer had cast for them. So the stork immediately began to plead his innocence, insisting that he just met the other birds and he had no idea that this was stolen goods and was begging for mercy.And the farmer frowned and said to the stork, "I don't care, birds of a feather flock together." Or in other words, one bad apple spoils the whole bunch.
We have been making great progress cleaning up our tests, right? Slowly but surely the number of failed tests were decreasing each and every day, but no one cared. Right? So we would report that our status has standups and no one cared, why? Because the build is still red. All it takes is one flaky test to make the entire build red, right? So even if most of the tests are okay and most of them are green, it doesn't matter, no one trusted the build as a whole. So we had the great idea to separate the test into two builds. One for the reliable trustworthy test, one for the unreliable test. And we ran both builds but the trustworthy build was the one for the team to use for CI.
Now, there were a couple of reasons why we kept both of the buckets of tests running. One, we wanted to continue to track the intermittent of the tests in our nice little database. And then the other reason is, out of sight out of mind. If we just took the bad tests out, there would be no visibility into just how much debt we were accruing, right? But with this approach, we could easily show a build and say, "Hey, listen, we need time to stabilize this, it's getting out of control." So our goal was to have less than 1.5% of the test as unstable, which was about 60 tests. And we felt pretty good with this 1.5 number because Google's number is 3%, so we were doing better than them. But seriously, if we went over this number, then new development stopped and it was all hands on deck to get the test failures back down.
The approach we used to split the tests into buckets was test annotations. So this typically comes in test runners that we use. And we use the unstable test group for tests that needed to be updated. And we use bug for tests that are failing because of a known bug. And also when we think we fixed the test, we didn't immediately move it back to the reliable bunch. We let it run maybe a good 10 times successfully in the other build before we would deem it trustworthy again. And for any thing that we marked as unstable, we opened a task for it and that goes on the backlog. And what I loved about having this in the backlog is that it gets prioritized right along with all of the other tickets, right? So we base the priority on how important the test is. So we would pull what we could manage into the sprint and in addition to whoever was on duty that week, other team members could also pitch in and grab a task if they had time.
Now, one lesson that I did learn while fixing up these tests is that if you find yourself patching and patching, it may be better to just delete the test and start over. So I heard a talk from this guy named Professor Atif Memon, and he teaches at the University of Maryland. Dr. Memon, he spent a summer working at Google and actually studying the fragility of tests. And he found that after three edits of a test, the test is exponentially more likely to become flaky, did you know that? So as I'm wrestling with a particularly troublesome test for about an hour, I couldn't get it to work, I remembered Dr. Memon words and I thought to myself, "I bet this is about like the fourth edit or something." so instead of fighting the test, I killed it, I deleted it. I coded it from scratch and then it worked.
So it took me like 25 minutes or so to write the entire test over versus the hour of struggling with no success. Also that test went from 113 lines of spaghetti code to 38 lines of clean code. So I say all that to say, it's okay to delete tests sometimes all together, without even writing them over if you find they are no longer providing value to you. Or if you want to keep them around and maybe rewrite them, we refactor our development code all the time, so it's okay to refactor tests as well.
All right, one more story. A dog and a hare got into a heated debate about which one was faster. And after a few minutes of squabbling, they decided to settle it once and for all with a race. The hare was smoking the dog and he eventually caught up, but not enough to pass the hare, right? So frustrated that he was losing, the dog grabbed hold of the hare and bit him, yes. So the hare screams out like, "Bro, what are you doing?" And the dog apologizes, he licks the hare's wounds, and he begs for a rematch. The hare being the kind, forgiving soul that he is, he agreed to another match. So they race again and just like last time, the hare is leaving the dog in the dust. And also just like last time, the dog is able to grab him and yet again bites him. So the hare yells out, the dog apologizes, and again, he begs for a rematch. The hare says, "No, bro!" One minute you're biting me the next you're licking my wounds, I can't trust you." I don't know if you're friend or foe."
Now, once we got our fail test down to less than 1.5% and they were in a healthy state, we had to make sure that the new tests that we introduced didn't become a problem. It took a lot of effort to get our builds to a place where we trusted them. And all you need is one new red apple coming in and messing up the program, and then everyone's rolling their eyes again. So if you have a lot of people contributing or if you see that you're introducing a high number of flakiness with the new test, I recommend putting the test in a private build that only you are monitoring. And after maybe like 10 successful runs, then you can add it to the big stage, the CI build.
The thing is, we sometimes are asking people to trust tests that we don't yet trust ourselves. And this is especially true when you have new feature areas or you have new people joining the team. I do realize that there's a lot of overhead because you're managing yet another build. But the upside of this is people are lazy, we don't want to manage yet another build, so people get in the habit of coding responsibly so that their test can go to the big stage right away. So once we got to a good system where the builds are only red, when there's something wrong with the application, we were able to use CI as it's supposed to be used, where devs are watching the builds upon check in and making sure that they're okay.
So we had our 4,000 and counting tests and we achieved the goal where there were no more than 1% of the tests that were unstable. And we honored that rule, that if we went above this number, that means we are out of control and stabilizing the existing tests became a higher priority than writing new features. The reliable build was green unless there was a change with the app. This is how it's supposed to work, right? So that change could be a bug or it could be an intended change, either case the test did what it was supposed to do, by letting us know this. And most importantly, the team began to trust the results again.
Thank you for listening. To register to The DEVOPS Conference, go to thedevopsconference.com. If you have not already, please subscribe to our podcast and give us a rating on your platform. It means the world to us! Also check out our other episodes for interesting and exciting talks. I say now, take care of yourself and see you in The DEVOPS Conference.
Java Champion and Master Inventor, and Sr. Director Developer Relations, Applitools and Test Automation University.
Watch all recordings from The DEVOPS Conference 2021 on the event's page: https://www.thedevopsconference.com/speakers