Every so often an unexpected event comes along that stretches us in ways we’d never prepared for or even thought of. For us, that event is the covid-19 pandemic. It is testing nation states, the global economy, and businesses of all shapes and sizes. Software development organizations are no different.
With everyone currently mandated to work from home, our systems and processes are under a new type of stress. Overnight we have had to transition from working side by side in offices to working remotely and alone. Systemic points of failure in our ways of working, previously hiding in plain sight, are now exposed.
In the office everything was readily accessible. Suddenly, we have issues accessing critical information or systems including project documentation, source code, test cases, and the test environment. Stop and think for a moment about how these are documented and shared in your organization. If you had to access these things right now - how would you go about it? And what about everyone else on the team? Do they have access?
Antifragility v. robustness
If you don’t have systems in place for dealing with disruptions, an unexpected event like the covid-19 pandemic can grind your software development to a halt. This brings us to the question of antifragility and robustness.
A system that is poorly documented and maintained is fragile in the same way as a wine glass - both can tolerate only a small amount of stress. Organizations that might have been able to deal with the stress of having one or two workers out of action are now having to deal with losing many more than that. The fragile system breaks.
This can be true for seemingly “robust” systems because antifragility and robustness are not synonymous. While a robust system can tolerate more stress than a fragile system, both of them have a fixed upper limit for stress. An antifragile system, on the other hand, becomes stronger as it is exposed to increasing amounts of stress. That doesn't mean that we don't want to have a robust system - of course we do! - but what we really want is antifragility.
DevOps prepares your organization to deal with stressful events which is the vital aspect of building antifragility. Every time you successfully work your way through a crisis, and use the lessons learned to improve your organization or systems, your level of antifragility increases. And practicing DevOps helps you to build a system that is both robust and becomes stronger by responding to crises.
But what does this look like in practice? How do we make an existing development organization more antifragile with DevOps? There are three key aspects to consider: the software production line, the people who operate it, and the monitoring of its performance.
1. Production Line
A centralized platform where all of the essential tools are known to everyone on the team is a prerequisite to antifragility. If an individual team member becomes unavailable it is easy to pick up their work because everyone is familiar with the tooling. This makes it very easy to reallocate resources.
Your organization needs to have a software production line that enables document generation as a necessary by-product. All of the requirements and workflow should be contained in documentation and ticket management solutions. The source code, environments, and test cases should all be in version control and the binaries should be in separate repositories.
If every team is using its own tools and following its own practices it is really hard to build antifragility across the whole business. A consistent, well built assembly line is antifragile by design and it’s the first thing you should build. When every team member has the ability to change and improve the production line any time there is a disruption, each change makes the system better able to withstand stress when future disruptions occur.
Once you have a centralized assembly line you need to make sure everyone is able to access everything they need. You also want to make sure that it is optimized to make the lives of the developers as easy as possible.
First off, this means that the requirements and workflow for any project have to make use of the respective tools available in the assembly line. There’s no sense in setting up a centralized platform only to then allow individual projects to introduce new tools that aren’t visible to the rest of the organization. All of a sudden you’ve undone a lot of good work, and decreased your antifragility. Now, in some cases, e.g. test and deployment automation, flexibility in tooling is necessary. The challenge is to find the right balance between shared solutions and those specific to certain teams.
Consistency in tools has to be maintained and a high level of automation needs to be applied to how they are used. You should aim to have all critical project documentation stored “as code” in version control and openly available to the whole team. Easily scalable cloud environments should be available along with test automation to facilitate continuous delivery. The automation code increases robustness by serving as its own documentation, reducing the need for separate written documentation which is often a neglected task.
High visibility should also be a feature - not just of your own development, but everyone else’s too. Status information should be automatically collected from the assembly line without the need for manual reporting. There are lots of very common tools that have this as a feature.
By making life as easy as possible for every member on the team you are increasing your antifragility. Making your assembly line easily accessible while automating as many manual tasks as possible makes it easy to adapt when a crisis appears. And whenever disruptions do happen the organization can consider course correction and further improve their stress tolerance.
Building antifragility requires continuous improvement in the way you produce software. But where should you direct your improvement efforts? How do you know where the potential weak points are? This is why you need monitoring and it’s the third aspect of DevOps you should be using to make your system more resilient.
It’s important to monitor how the software is being built, but it’s also essential to monitor how it is deployed and also how it is actually used in production. Monitoring development allows you to see which projects and individuals are adhering to the agreed ways of working. It will also give you insights on emerging good habits and best practices that can be applied to other projects. When every project is adopting the strengths of all the others they become as strong as each other. That addresses a lot of individual weaknesses across the entire organization and boosts antifragility.
Monitoring in Ops works as an advance warning system by identifying problems before they get out of control. It is obviously much better to spot problems rather than have them pointed out by users. Your deployments should be integrated with monitoring tools and logging analysis to give you extensive information on the state of the product. With adequate monitoring you can often pinpoint the root cause of a series of problems instead of merely treating the symptoms of a single problem. Again, this is contributing greatly to your antifragility.
In post production you should monitor the behavior of users when they receive new features. Does the new feature increase traffic, conversions, and ultimately revenue? Data gathered in post production can be used alongside design led thinking, A/B testing and behavioral psychology to optimize future releases.
How antifragile is your organization?
The unexpected nature of the current covid-19 outbreak means many organizations are currently discovering just how antifragile they are in real time. And those who come through the crisis will be well placed to respond to future events. In many ways this is the ultimate stress test and an excellent - if unwelcome - opportunity for learning.
Building antifragility is about how we respond to crises. By centralizing the core tools in your software assembly line and using an accessible, universal set of tools you are already on the way to becoming antifragile. However, it is a mistake to think that a centralized system on its own solves every problem. Without dedicated maintenance a centralized system will contribute negatively to antifragility because the entire organization is dependent on a single, poorly maintained system. It is also the case that exceptions need to be made for some tool selections. The key is to be flexible when the occasion demands.
Culture is also vitally important. Empowering your team members and encouraging continuous improvement habits means your whole system gains a little more antifragility every time there’s even a minor issue.