Learn how to foster a healthy and balanced DevOps culture with the help of blameless postmortems.
Psychological safety has been identified as the topmost feature of a successful and innovative organization. At the same time, we need to learn from failure and prevent recurrence of mistakes. These two practices seem to contradict each other, but is there a way to achieve them both?
The postmortem philosophy
Whenever an incident occurs, typically the first responder’s priorities are to:
- Mitigate the incident
- Fix the underlying issue
- Ensure services return to their normal operating conditions
Next: ensure the system doesn’t break in the same way again. Unless there’s a formalized process of learning from these incidents in place, they may reoccur. Left unchecked, incidents can multiply in complexity or even cascade. This can overwhelm a system and eventually affect users.
Postmortem culture is applicable beyond large scale or technical problems. As long as you understand the basic framework, you can use it to address a variety of failures:
- for personal growth
- growth of others
- encouraging accountability
- sharing best practices
- documenting facts for the long term
It can also be a very powerful tool for leadership to be transparent about unpopular decisions and document their rationale behind their choices. Introducing such a culture can seem intimidating at first, but implementing this change incrementally is possible and you can gradually fine-tune the process based on your organization's needs.
Tips to get the most out of this blog post: think of a high-impact failure you experienced recently and how the situation was handled. Then apply the principles and best practices of a blameless postmortem culture to think what could’ve been done differently.
What is a postmortem?
A postmortem is a written record of an incident and:
- Its impact
- The actions taken to mitigate or resolve it
- The root cause(s)
- (Very importantly) the follow-up actions to prevent the incident from reoccurring
Here’s a great example for inspiration..
Your team needs to define postmortem criteria before an incident occurs so that everyone knows when a postmortem is necessary. In addition to these objective triggers, you may also request a postmortem from another team if they think it is warranted under the same criteria. Over the years, postmortems have become such an innate component of the Google SRE culture, that they are organically expected and enthusiastically anticipated after any significant undesirable event - postmortems are also looked at as an opportunity to learn from each other.
When should you write a postmortem?
It depends on your needs and circumstances. A one-criteria-fits-all approach doesn’t really work here. E.g. a team whose primary responsibility is ensuring the reliability of a website’s e-commerce infrastructure will have distinctly different success and failure metrics than a team whose primary responsibility is product development. The two teams will also have a distinct set of personalities, making incident management nuances different.
Here are some example scenarios which could warrant a postmortem:
- Business reasons: x+ users impacted, y $ lost in revenue, loss in user trust due to error, etc.
- Product reasons: Latest canary reveals z% regression in a metric, a risky change pushed during production freeze, an unusually complex remediation, etc.
- Opportunity reasons: An outage revealed some repeated problem or an opportunity for systemic improvement, where there's a high value in sharing what was learned.
- People reasons: Disruptive reorganization impacted people’s careers, improper project management overloaded people, etc.
What matters is that these criteria are defined in the first place and periodically revisited.
Another idea could be to group multiple small incidents that have a similar nature, into one postmortem for resolution, rather than writing one for each smaller incident.
Writing postmortems should be looked at as a learning opportunity for the entire company and not as a punishment. It’s definitely worth acknowledging that a good postmortem process does present an inherent cost in terms of time and/or effort, so you need to be deliberate in choosing when to write one.
Writing a postmortem just for the sake of documenting is not enough. In fact, if not written well, it can be counterproductive to your culture. It could lead to an atmosphere in which incidents and problems are swept under the rug, leading to greater risk for the organization. So to make the most of postmortem culture, it’s absolutely imperative to keep them blameless.
A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. It focuses on 'what' went wrong (i.e. the systems, processes, etc) instead of 'who' was wrong (i.e. the people).
As humans, we often find accepting failure to be very difficult. Sometimes our mistakes can cost our company a lot of money, and accepting that something we did caused it can be very embarrassing and distressing. In a very toxic organization, publicly acknowledging mistakes can also cost people their careers. All of these fears will make it almost impossible for a postmortem to be valuable, fact-based and objective.
The best practice to combat this fear is to keep your postmortems blameless. But why?
We are humans - we fear public humiliation
Speaking up under normal circumstances is hard enough. Speaking up under immense pressure (as in the case when there's been a major incident) is much harder and people generally tend to avoid the spotlight for the fear of being ridiculed. Ironically, a crisis demands someone to step up, to think differently and then speak up.
Postmortems are artefacts
So now we’ve increased the complexity of the problem. Not only does one have to risk isolating themselves in difficult circumstances, they also fear being documented in history! What if I make a mistake? Everyone will know! Will I be fired? Will I not get promoted? Will my future employees make fun of me? These are all valid concerns. Fundamentally, the culture should encourage postmortems as valued contributions by individuals immediately and in the long term.
You will curb innovation and autonomy
Dreaming big and working on revolutionary ideas bring a certain amount of associated risk which may result in failures. If a person doesn’t feel psychologically safe in taking calculated risks in their environment, they may never act on those ideas and innovation will stagnate. Likewise, a healthy postmortem culture helps individuals anticipate and manage that calculated risk. Either situation takes away from individuals' autonomy as they’ll tend to just follow instructions from their managers or someone else (another form of safety), instead of being creative or using their own good judgment.
It may seem like a good idea to highlight individuals while describing an outage in a postmortem. Instinctively, it feels like assigning ownership to someone - which may then motivate the individual to take responsibility. But the big risk of doing so is individuals becoming risk averse because they may fear public humiliation. This can lead to people covering up facts and risking transparency which could be critical to understanding an issue and preventing it from recurring.
When mistakes are hidden, fixes to systemic issues are harder, and the problems are more likely to recur.
Also, blaming humans tends to result in "fire human" as an action item. For a moment, even if we overlook that it may not be the right thing to do, it still does not prevent recurrence. If the system was set up to enable the first human to commit the mistake, there’s a higher probability of a less experienced human repeating that mistake. Blameful behavior is detrimental from a business perspective also.
Blameless postmortem in action - an example
Here’s an example that was noted as the root cause of a huge hypothetical incident.
“dylanone@ did not bother to set up alerting for our storage cluster or check our hard drives manually in case of doubt. Of course we ran out of space and this disaster ensued! It took hours to fix the service because annatwo@ did not know how to recognize storage exhaustion and restarted the wrong systems.”
What's wrong with the way this is written?
Not only is this example highly blameful where obviously individuals are being scapegoated, it is also unhelpfully dramatic. Phrases such as ‘disaster’ or ‘of course this happened’ add absolutely no value to the postmortem document. Analyses of the actions of individuals also need to be placed in context - the state of the response effort, what information folks had at the time, what playbooks said to expect, etc.
What blameful behavior does is erode psychological safety and make an organization’s culture toxic. You can be sure that dylanone@ is not going to speak up when another outage happens and the company may lose out on valuable information. They may end up at odds with their team or with annatwo@, who in turn may be either upset at dylanone@ for creating the situation or anticipating blame themselves for being slow to fix it. Worse, imagine if one of these individuals is a tech lead or manager of the other - blamelessness makes it possible for leaders to admit fault and protects ICs from being blamed for the assumed infallibility of their leaders. Both these individuals should feel safe discussing what they did and didn't know that led to a situation.
A better way to express the gravity of the situation would be to gather facts and document them objectively, with metrics or reflection on the context of the surrounding systems/procedures.
Here one example of the more objective, blameless version:
“Standard storage cluster management does not generate free space monitoring by default. The hard drives in one of the trading storage clusters ran out of space and this was not noticed due to a lack of alerting. This led to trades from that cluster being redirected to other trading clusters which were also almost full; standard error-code mapping, dashboards and alert playbooks made this harder to assess, and required folk knowledge of trading service failovers to diagnose.”
- The cost of failure is education.
- Keep your postmortems blameless: Concentrate on the system, not the people.
- When written well, acted upon, and widely shared, blameless postmortems can be a very effective tool for driving positive cultural changes and preventing recurring errors.
Published: May 7, 2021