Incident response policy and practice

It's time to transform the effectiveness of your incident response strategy because cyber threats are inevitable and becoming increasingly sophisticated. Marc and Darren highlight the importance of disconnecting rather than shutting down during incidents and more in this episode. Join in the conversation at The DEVOPS Conference Global and experience a fantastic group of speakers alongside a worldwide live and online audience made up of practitioners and decision-makers.

Darren (00:00): I think if there's one thing to learn from that is if you suspect something, disconnect, do not shut down.

Marc (00:17): Welcome to DevOps Sauna season four, the podcast where technology meets culture and security is the bridge that connects them. Welcome back to the DevOps Sauna. I'm here with Darren again. Hi, Darren.

Darren (00:37): Hey, Marc.

Marc (00:38): It's always nice to have a conversation, especially in your area of expertise about security.

Darren (00:44): Yeah, I always feel a little bit more home discussing security than I do the DevOps topics. So let's see what we can go up to today.

Marc (00:52): Yeah, I heard that you are one of what, three people that could get tickets to Disobey.

Darren (00:57): Yeah, the tickets for Disobey sold out extremely quickly this year. I think it was two minutes, and then they were gone. But I was lucky enough to get some of them. So it was an extremely cool event.

Marc (01:08): Pretty high demand. How many people were there?

Darren (01:11): I think 1800 attendance in total over the two days.

Marc (01:17): That's like everybody's just sitting there refreshing, waiting for the tickets to open and then boom, it's all gone.

Darren (01:23): Almost instantly, yeah. But there are a lot of cool talkers, especially there was just one talk given by the police, which was entertaining, if slightly intimidating to have three people in police uniform stood on stage.

Marc (01:34): Help the police please, ladies and gentlemen, and those who have not yet decided. So any big topics there? Is there something that we could talk about today that came up at Disobey?

Darren (01:43): It's actually a topic that I think was kind of missing from Disobey because there was a lot of technical things. But one topic I think we should discuss that everyone needs to be thinking about right now is incident response. And there was actually very little about it, but at least maybe I was busy with Capture the Flag while the speakers were talking about that particular subject, but maybe you and I can handle it here.

Marc (02:06): I think we can. I like this topic a great deal for many, many reasons. But I also understand, it's not the sexiest topic in the room for a conference on security. But I think this is one of the areas where companies can do an awful lot and get a lot of bang for the buck in just taking appropriate actions to be able to have a proper incident response policy and things. What is incident response?

Darren (02:30): It's exactly what it says on the tin, it's responding to security events as they happen. So it's usually broke down into phases, you have discovery phase, which is your moment where you realize something's wrong, you say, okay, that is a security incident, then an analysis phase where you analyze what's happening. And there's an action phase where you talk about... you don't talk, you actually implement the fixers, whether they're temporary or permanent. And then post game phase root cause analysis that kind of thing, and all of these four things together make incident response.

Marc (03:09): I think that the neat thing that this is all about is you can do this whether you have tools in place or not, can't you?

Darren (03:16): Yeah, I say this comes mostly as a policy and mindset approach because if you asked me about good incident response tools, I'd have to say there probably aren't any because you have to be able to respond to any kind of incident. And if you have a tool for it, it's going to be super focused in a specific area, but it's more about how people are trained, how people go through annual security trainings, any simulations, any testing you have, and you build up this resilience to bad situations happening, and basically, how to not make them worse, how to prevent the spread, how to mitigate them, and limit them as much as possible.

Marc (04:04): I talk a lot in my work and mentoring work and other things about the art of practice. And one of the things that you highlighted for me when we've been talking about incident response is that practicing, going through drills, fire drill type of things, going through how do you respond to various incidents and simulations was a word that you had used. And I think it's not just about having the-- it's certainly not about the tools. And it's not only about the policies, but it's also about practicing and exercising how you will handle different types of incident responses and anticipate those that may come in from a place that you weren't expecting at all. I guess COVID was one of those.

Darren (04:47): Precisely, yeah. And the importance of drilling is one of the things that's so understated. So if you have a perfect incident response plan that you've never practiced, it's essentially useless It's just information on a bit of paper. And if you then when you start having an incident, if that's the first time you sit down and read your incident response plan, that is just time lost as you tried to understand what is written there. So you need to drill. Not only do you need to drill on the people, but you need to drill the plan itself, you need to identify in the plan, which parts of it make sense, which parts of it are valid, which parts of it don't make sense and should be removed or should be refined. So you end up with this circular set of testing where you generate a test and you test in both ways. You test people's ability and readiness to handle incidents, and you test the plans, ability to support people in handling those incidents. And then you run through that a handful of times and you get this thing that should work, but you can't know if it does work until something actually happens and you have to do one of these things for real.

Marc (06:01): This is not a drill. So what kinds of things then go into an incident response plan? Maybe we should start with the identification phase? How do we identify different types of incidents? And maybe what are the different types of incidents?

Darren (06:18): Okay, so in an incident response plan, it's not going to have categorizations of incidents because an incident response plan is something you actually pull off when you know you have an incident. So it's there for specific incidents. But an incident can be anything. For example, one example I do like to talk about that was actually in Eficode security system is someone lost their phone by dropping it in the Baltic Sea. And that's a security incident, it's not well, we can do anything about. We're not going to go and get scuba gear and start diving for that phone. We just have to assume the tides will take care of it and remote lock it as much as we can and move on. But these kinds of things happen. I'm wondering if this is going to backfire at some point, and someone's now going to go diving for an Eficode phone in the Baltic Sea. So if that happens, I'm going to be in a lot of trouble, but I think we'll be safe. But that's an incident, but that wouldn't necessarily trigger the incident response plan. And incident response plans are for active incidents of a certain threshold. So what we will think of would be things like active attacks, malware being found on the systems or traffic from suspicious IP addresses coming in in rapid time-frame. So we have a situation where the incident response plan needs to be as flexible as the type of incidents because as soon as we narrow it down to if you here are the incidents we might face. As soon as we face an incident outside that plan, the question is, well, does this become triggered by the plan? Does this become a problem? So we have to make it general. And that's how we start. So we have categorizations of incidents in various places and documentation, and then an incident response plan saying, if something is considered by the security team, a large enough incident, then it's triggered. And having that flexibility and knowledge of the security team is what starts the plan.

Marc (08:19): Good. And then we don't need to go too deeply in, but there's all types of also human incidents that require response. I saw somebody come in through the door that I had opened with my tag, just as my elevator door was closing. And then by the time that I get back down, I don't find them. So that is a security incident.

Darren (08:42): So we're going to find out who's in more trouble later today when we find out who that person is, and whether the phone gets lifted.

Marc (08:48): This is an old one. And what I learned from this is many security officers are quite eager for you to over-file incidents and would rather you file incidents than not. So in a case like that, I filed it with our security officer, they cleared the situation fairly quickly. They found that a key had been used within essentially moments of my key being used and that it seemed as if it was a legitimate follow in that case. But there's these types of things that can come up.

Darren (09:20): Yeah, definitely.

Marc (09:20): So there's a simple type of analysis that I just described, but let's talk about some perhaps more technical ones. So what other types of analysis are there in phase two of an incident response?

Darren (09:31): What we will see typically is if you are ISO compliant, you will have a centralized logging system. If you are compliant with ISO 27001, 2022 version, you will have some threat analysis happening against that login. Maybe it's AI, maybe it's just standard algorithms, but it will be analyzing traffic and deciding if this is suspicious. This is something that security teams have been doing now for 20 or more years. And that's what we're looking for. We're looking basically for strange metrics. That's all that triggers our suspicion most of the time an odd metric, like seeing odd CPU usage, where we don't expect it or perhaps processes running under users we don't expect to be running those processes. We might track recent changes to files, for example, in 2007, when PHP was still kicking. And you could form these quite sophisticated attacks by injecting obfuscated code into the first line of PHP files by adding a load of spaces and then pulling your code. So when you opened in a text editor that didn't wrap the words on to the next page, it would seem empty. You just have a little question mark at the right side showing that more line continued. And so what we're looking for in the discovery phase, let's say discovery and analysis kind of merged together in that it's anything out of the ordinary. And then after we know that there's something out of the ordinary, we want to isolate it. So that means pulling off the network, not switching it off, though, that's extremely important. And I think if there's one thing to learn from that is, if you suspect something, disconnect, do not shut down. Because if you shut down, you can actually lose things that are loaded into the memory, you can lose temporary files, which might be important. So just keep in mind disconnecting to prevent any movement laterally, but not shutting down is ideal.

Marc (11:42): Nice. So we do some analysis in order to identify incidents. And then we have another phase called analysis where we try to figure out what to do about it.

Darren (11:55): Yep, that's true. And this is actually one of the more boring phases because there is this mentality in cybersecurity. I don't know if you've run into it that's just saying there is no surefire way to make sure you've cleared up everything. And that's why we live in an era of backups. It's why backups are so vital. The correct thing to do, if you have a security incident is nuke everything and restore it from backups. That's the safe play.

Marc (12:21): I'm not sure exactly how many of our customers and listeners are going to like that as a solution, but can you open it up a little bit?

Darren (12:30): Maybe if you can tell me why do you don't think they'd like it?

Marc (12:34): Yeah, loss of business is an interest. And of course, I understand we are compromised. But we haven't discussed things like attack surface yet. So there was an interesting case that came up during last week in a public nougat repository there was a vulnerability that was identified, and then it was retracted. And what happened is there were in some places things failing. And then when it was retracted, people were like, hey, you know, we're not even necessarily a delivering to production type of system. We are an in-house system that didn't have literally any attack surface in this area. So why would we allow our in-house systems in order to stop for a little while just because there may have been a vulnerability identified, and we didn't even analyze if it had a valid attack surface, which in this case, it would not have very likely because it was all inside of a firewall. And it wasn't a malware thing. It was more like just a potential vulnerability for a package that wasn't even revoked.

Darren (13:31): But you raise a great point there because what you're actually talking about there is not a security incident. In security, we have two types of events, we have security notifications, withdraw, a vendor saying a package is in some way caught potentially vulnerable. And then we have an incident which is an instance of a vulnerability or something else actually appearing. And security notifications that are just pay patch this, patch that, upgrade, GitLab, what have you? These are not security incidents. And this freezing is important because if you treat these as security incidents, you will have a security incident every other day at your company, depending on how large your tech footprint is. This isn't a security incident. It is an event it is a notification. And all that needs to happen is patching needs to happen. It's that simple. You just need to patch. However, an incident is a specific instance of exploits a specific instance of malware, a specific event that happens not the possibility of an event. And it's important to draw that line because Incident Response only deals with the specifics, not the possibility. It requires an actuality.

Marc (14:51): Very important definition. Thank you. I knew this, but I didn't exactly put two and two together that until we have actually identified that there's a malicious actor, it's not an incident.

Darren (15:03): Precisely. That's when the response type of nuke everything and reinstall makes sense. When there is a specific threat actor, when there is a specific instance. When it comes to notifications, obviously in Eficode in the root side, every time we get security notification, we don't blow up the platform and rebuild, but that would be a considerable waste of time.

Marc (15:25): Yes, I understand. Good. There's been kind of a change over the years that I understand as well, where there's been perhaps more emphasis, certainly on incident response over, and I don't want to say that people aren't paying as much attention to actually keeping the system secure. But can you even have a secure system, really?

Darren (15:46): I don't think so. Not while you want it to be online and functional, right? You can have a system that is as close to secure as you get it. But let me ask you, how familiar are you with defense in depth?

Marc (15:58): I have some familiarity with certain areas.

Darren (16:02): Okay. So defense in depth, if I give a 30-second rundown is this idea of breaking security into different layers. You end up with this image that's like an onion. You have all these layers of security, and the outer one is your outermost protection. And if you're talking about an office that might be like a building, it might be walls, fences, going all the way down through network security, endpoint security. So your devices and their antivirus, right to the core of like protected assets, things like your code base, your data bases, and did these different layers of security that apply to each of them. And this is the switchover that I think you're describing that went from prevention, which was perimeter-based security, where keep the outside safe, and assume that anyone on the inside is supposed to be there to a lower trust model, not zero trust, but a lower trust model of protection at every level. And that change is going on now, I believe.

Marc (17:05): Least privilege and things like this.

Darren (17:10): Yeah, principle of least privilege, but also things like having network segregation, and internal access control lists to ensure that your least privileged people can't access highly sensitive data. So these are all controls that go into that defense in depth. And this is the mentality that’s moving because we've stopped talking about if when it comes to cybersecurity, and we've started talking about when. Well, let's say I would like to believe people are talking about when and not if. And if you are still talking about if you have a breach, then that's a problematic mentality that will lead to this focusing on perimeter and focusing on defense of keeping people out instead of having accurate and adequate responses.

Marc (17:59): Good. I thought when I said can we have a secure system, you would only say if we switch it off.

Darren (18:07): Pretty much. I mean, that's the old joke. If you want a secure system, switch off, pull the network cable and lock it in a cabinet.

Marc (18:14): You can also see, I believe it's called no code by our old friend Kelsey Hightower, if you would like to. When you have no code, it is absolutely secure.

Darren (18:24): I see. I'm not familiar with that. How does that work?

Marc (18:27): It's a satire on software development that also goes into the only secure system is no system, if I remember correctly.

Darren (18:36): So if you don't code it, it can't be hacked. I like it.

Marc (18:39): That's absolutely true. All right. So this is turned really interesting that it's not so much about the tools, it's not so much about the technology. And it's not even just having the policies. It's about if we continue to practice with these, and we are encouraging the understanding of what incident responses truly are and working towards analyzing those, we get to the point where we have to start fixing things and you mentioned new kits and start from scratch, are there any other fixes that we need to think about or that someone might not have thought of?

Darren (19:18): You can always try, but it's always going to be second best compared to restoring it from a backup. And so many security incidents, many kinds can come back to having a robust backup policy of ensuring that you have those backups in case of malware in case of ransomware in case of corruption. So obviously if someone gets into your network, and you are unhappy with that you can block them out, you could remove their back-doors. If there's no persistence on there, you can monitor the server. However, it's difficult to know that you have ever cleared 100% of traitors to prevent them from getting back in. It's like trying to save one of your fingers. It's a weird position to be in where you have someone who wants to save a specific server, and they don't understand. They're taking the approach of pets instead of cattle because this is their server, and they want to keep that server. And they don't understand that the server is dying and potentially spreading whatever killed it to the rest of their network. So it's a difficult mentality to face off against.

Marc (20:29): Of course, we do practice in IT in general, for years testing that you can actually restore from backups. I guess the Internet, or the infrastructure as code aspect is also an interesting thing to make sure that people are able to redeploy systems from backups.

Darren (20:46): Definitely. That's part of the simply being able to get out as fast as possible. When we do these disaster recovery things backup testing, we're looking at two metrics, the objective of return to normal business basically, and the objective of let's say, you having everything resolved. So we have these two metrics one is the Sunnah metric of okay, things are okay now, and that'll slightly lead to metric of things are restored to standard business. And it's all about getting the time to these metrics as short as possible. So anything you can do to speed that up is great. And infrastructure as code is one of the most powerful tools you can have. The ability just to redeploy immediately based on what I should say, make small changes, and then redeploy immediately because if you redeploy exactly the same vulnerable software that someone just got into, they're going to get into it again. So you redeploy patch, restore.

Marc (21:49): One thing I'd like to try and challenge, it's difficult to challenge you in your own garden, Darren, but the term root cause analysis has annoyed me for years. I conversed to that, which is contributing factors analysis. And we still talk about root cause analysis all the time. And it's even in a lot of specifications and standards, but it always kind of feels like a little bit of the wrong idea to me because it's never one reason that things fail. There's always, I think, more than one. And that's why I always try to at least talk about contributing factors. But what do you expect from root cause analysis and incident response plants?

Darren (22:25): I actually think you're spot on that. We talk about root cause analysis, but then there's also this term that people use which is blameless postmortem, where they want to look at it and decide what happened and why. And if you do root cause analysis, we already know that 90% of security incidents are caused by a person who made a mistake. So if you look at root cause analysis, it's the opposite of blameless postmortem because you're quite literally pointing at a person and saying, okay, this incident was your fault because you are the root cause. And that actually helps no one because it doesn't matter that we know 90% of incidents are caused by people. We know this. This is a known factor. And it's not going to change. Well, I wish it could, but it's not going to because people are people and people make mistakes. And that's why so much of this incident response is around processes and people rather than tools and technologies, because it has to be about understanding people. So you have this contributing factor idea is actually where we should be dealing with incident response. But it's still called root cause analysis. Actually, if we look at it as a root cause, we think about a tree, actually it makes sense because there isn't a single root for the tree, there is this expanding network underground of different tendrils reaching into different places. And if we examine this as the root cause, and then this causes the tree, then it makes sense, but I'm sure that's not how people were thinking about it. So 100% contributing factor analysis is where you need to go because even if a person makes a mistake, it means there wasn't a safeguard there for them, there wasn't a guardrail for them, there was nothing to help them that maybe not even awareness training. So yes, contributing factor analysis is definitely the way to go. And if someone like me says root cause analysis, challenge me and tell me I'm wrong.

Marc (24:26): You brought this back to me really, really well because I have thought of this contributing factors for a long time, but the way that you placed blameless postmortem to me actually gave it a new light because yeah, an awful lot of times it does come down to a person and if we can make the event blameless, then it helps the psychological safety of everybody involved to understand that we're all going to make mistakes. Software is one job that everybody should have because the best guys make mistakes and bugs every day.

Darren (24:58): Yep. And that that idea of blameless is actually something that should be taken out of the final phase and extended across the whole incident because if you start by telling someone that they have made a mistake, the first thing they are going to do is clam up and try to cover their mistake. It's human nature that they're just going to say, well, I'm sorry, I didn't mean to do this, and then shut down, and you don't get the information. So blameless postmortem, yes. But the whole process should be blameless across the board. Because people make mistakes, people will continue to make mistakes. And the more responsibility people get, the more access they have, the larger their mistakes are going to be hopeful that when we can hope that they will be less frequent, but they will still occur.

Marc (25:47): All right, Darren. Let's try this again. So what is incident response?

Darren (25:53): Incidents response is exactly what it says it is. It's how to respond to security incidents and not security notifications. So when a specific occurrence of an attack or a vulnerability is exploited, that is an incident and how we respond to it, how we fix it. Excellent.

Marc (26:12): When should we think about our incident response plans?

Darren (26:16): Yesterday. It's 2024. If you don't have an incident response plan in place at this point, what have you been doing for 10 years?

Marc (26:25): Excellent. So what are the trends in incident response?

Darren (26:29): There's a movement from preventing to responding. So it's kind of switching around to we know someone will eventually get in, we want to mitigate the amount of damage that is done there.

Marc (26:41): Excellent. Two more questions, the four phases of incident response are

Darren (26:47): Well, they're going to be discovery, where we find out something is happening, analysis where we find out what it is, the prevention action phase where we fix it, and then the postmortem phase.

Marc (27:00): Yeah, the blameless postmortem.

Darren (27:02): Blameless postmortem. Yes, blameless. We have to make sure it's blameless.

Marc (27:05): Excellent. All right. Can we have a secure system?

Darren (27:10): Not perfectly, no. But we can do our best. And that's all we're ever doing in security, is what we can.

Marc (27:17): All right, cool. I always learned from you, Darren, and it's always nice to have a conversation about these things. And I think it's neat how we're able to look at different parts of it in different ways and still be able to understand. Thank you a lot for today.

Darren (27:33): Thank you. I hope someday we'll get to compare and I'll get to grill you on one of your specialists subjects because it's starting to feel a bit one-sided. Let's do something like that next time.

Marc (27:43): All right. We'll talk about jazz. Okay. Thank you, everybody. Thank you, Darren.

Darren (27:48): Thank you, Marc.

Marc (27:49): And we'll see you at the DevOps Sauna next time. Goodbye. We'll now tell you a little bit about who we are. Hi, I'm Marc Dillon, lead consultant at Eficode in the advisory and coaching team and I specialize in enterprise transformations.

Darren (28:07): Hey, I'm Darren Richardson, security architect at Eficode. And I work to ensure the security of our managed services offerings.

Marc (28:14): If you like what you hear, please like, rate, and subscribe on your favorite podcast platform. It means the world to us.

Published: February 28, 2024