Don't count your incidents, make your incidents count
incident.io
Posted on May 3, 2022
We can't have more than two major incidents per quarter.
It happens all the time: senior folks at your company feel like things are out of control, and they attempt to improve the situation by counting how many incidents you're having.
And it's not an unreasonable approach --- on the surface, the number of incidents seems like a great measure for how well things are going.
Whilst setting targets might work in some organisations, it's worth considering whether they provide the signal you expect and whether the implications of doing so have been properly considered. We've had this conversation more times than we can count, so here are a few tips on how to navigate the situation.
Fewer incidents doesn't mean things are better
The absence of incidents doesn't mean your systems are reliable or things are safe. I've worked in teams where we've had months of smooth sailing, followed by intense periods of seemingly everything being on fire. Nothing materially changed between the two periods. A deeper analysis showed the many contributing factors present throughout. We just got lucky and the perfect storm of latent errors and enabling conditions didn't occur in the first instance.
More incidents is no bad thing
Incidents aren't an evil we need to stamp out. In many cases, they're the cost of doing business. We shouldn't encourage failure, but despite our best efforts to maintain high levels of service, surprises will catch us out. When done right, a healthy culture of declaring incidents can be a superpower. I want my teams to feel comfortable sharing when things may be going wrong, be excellent at responding when they do, and democratise knowledge and expertise after the fact --- this is exactly why we build incident.io.
Targets can drive the wrong behaviour
I've seen people arguing why something is or isn't an incident because they don't want to reset the "days since incident" counter. Equally, I've seen engineers waste time in an incident trying to justify a minor severity rating, rather than major, because they don't want to trigger the company target.
As stated in Goodhart's Law, "when a measure becomes a target, it ceases to be a good measure". If you set a low target with severe consequences, you'll probably meet it, whether that means suppressing reporting, arguing over labels, or some other counterproductive measure.
Targeted or not, you're not in control
The vast majority of incidents are outside of our control. At best, a "no incident" goal is un-actionable and ignored. At worst, it can alter behaviour to the detriment of the organisation.
If you were set a target of not spilling a drink for a year, what would you do differently? Nobody sets out to spill a drink, and when it happens it's not because you're careless, it's just random chance sprinkled with misfortune. Pick a better target, like suggesting I don't run with drinks.
There are better alternatives to counting incidents
So you've convinced your leadership team it might be a bad idea, but to seal the deal they're after an alternative. What can you offer in return?
The best advice is to understand their motivations for the goal. For example, is there a lack of trust between leadership and engineering? Is that fuelled by them seeing incidents, but not seeing the analysis and follow-up that happens afterwards? Perhaps a target around the number incidents which didn't have a debrief would help.
Whatever the motivation, here are a few options you might want to consider.
Measure what you actually care about
You don't really care about the number of incidents. You care about what that means; whether it's lost revenue, customer satisfaction, or the service you provide --- incidents are just a useful proxy.
Instead, measure the thing you actually care about like service uptime, the number of times PII data was shared, or the number of failed payments. These are tangible measures that can be targeted and improved.
Measure the value you get from incidents
If you can accept that incidents are unavoidable surprises, why not measure how well your org is using them to improve?
We suggest writing debrief documents that are used to educate, holding sessions to discuss them, and ensuring you're seeing follow up actions through to completion. If you do all of the above, you're likely getting your money's worth. (Pro tip: you can generate incident timelines, post-mortem documents and follow up actions with incident.io with one click directly in Slack)
Give them the metrics they want, with the context they need
If you can't convince people not to target the number of incidents, why not provide the metrics they want but with the context they need to understand the full picture?
Rather than "we had 5 major incidents", share the contributing factors and risks, the commonalities and differences, and what's being done to improve. It's relatively easy to take the heat out of a number by providing some qualitative context. As it happens, there's a great post from the Learning from Incidents blog about this here.
If you've got any pro tips of your own, we'd love to hear them! Send us an email at hello@incident.io, or find us on Twitter at @incident_io.
Posted on May 3, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.