Know your systems failure modes
Fawad Khaliq
Posted on May 14, 2021
Not long ago, in 2009, a behavior in the system (i.e a mode confusion) was part of the events that led to the loss of Air France Flight 447. The pilots reacted to a loss of altitude by pulling on the stick, which would have been an appropriate reaction with the autopilot fully enabled, which would then have put the aircraft in a climbing configuration. However, the airplane's systems had entered a mode of lesser automation ("direct law" in Airbus terms) due to a blocked airspeed sensor, allowing the pilots to put the plane in a nose-high stall configuration, from which they did not recover.
We have come a long way in systems to build reliable software and techniques, however, systems still fail all the time. What makes some systems more prone to failure than others?
Often times, we attribute failure to complexity. That's a fair answer but the experience and evolution of software says there's more to it. Running large (in some cases, literally the largest), complex systems for more than a decade, one pattern I repeatedly see is failure modes or modes in general. And when not done right, modes can make a system intrinsically unstable. Every system has failure modes but the most common and nasty ones are introduced by bimodal behaviors.
In the book "The Better Angels of Our Nature", Steven Pinker talks about how today we may be living in the most peaceful time in our species' history, despite what the news tells us. (highly recommended if you haven't read)
I know, it's a cheesy, weird parallel to draw here (with system failures) but today we (systems operators) may be living in the most peaceful (i.e. less oncall pain) time in our species' (systems) history. That's because years of academic research has gone into this very topic.
A mode is a distinct setting within a machine interface, in which the same user input will produce perceived results different from those that it would in other circumstances. e.g. for vi, there's one mode for inserting text, and a separate mode for entering commands (sorry, Emacs users but I'm sure you get the point). These are fairly benign modes that you deal with everyday and are mere nuances for beginners.
However, there are modes that can cause actual production downtime. You may recognize some:
- If a Kubernetes pod normally calls a cluster local service, but can fallback to an external service under a certain condition, that's a bimodal behavior.
- If you call your database every 5 minutes (happy path) but in case of failure, you retry every 100 milliseconds, that's a bimodal behavior of the system.
Take these failure modes seriously. Bimodal/fallback behaviors are harder to test. They exercise your system in ways where "fallback path" or "secondary mode" will become less tested over time. Your primary mode will become resilient, but the day the fallback behavior kicks in (and it has latent issues), your system availability will be at risk and you will have nasty outages.
Here are some alternatives to avoid bimodal behaviors in the examples I shared above:
- If a Kubernetes pod calls a cluster local service and the service is not available, instead of falling back to an external service, failover to a replica of your cluster local service or improve the reliability of your cluster local service.
- If you call your database every 5 minutes for the happy path, keep the same frequency when it fails. With 100ms, your database might receive a thundering herd of 3000x calls, potentially triggering another set of cascading failures (that's a topic I will cover another day)
Avoid bimodal behaviors when building systems. Know your failure modes. Fail cleanly and predictably. It's a simple concept that will bring more "peace" in running systems.
Posted on May 14, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.