Preventing Outages: Limitations of Even the Best Observability and Monitoring Tools
jameslaneovermind
Posted on May 25, 2023
It was a Friday afternoon and we had planned to roll out a big change that we’d been working on and testing on all week. We knew this was a bad idea, but we were confident! Firstly the change was related to the way the backend UNIX fleet authenticated user logins so should have been fairly innocuous, and we had done all the testing we possibly could, but there was still some risk.
So we pressed the button and rolled out the change. The results came back all green, we could log into the servers, and all that we needed to do was wait. As we waited for all the results to come back, the phone rang.
“Hey, nobody in the department can save PDFs anymore.”
The whole department is at a standstill because they can't save PDFs. We haven’t touched any laptops though, how could we possibly have broken the ability to save PDFs? We started frantically looking into it and it turns out that they aren’t clicking Print -> Save as PDF as you’d expect, they have an actual printer called “PDF Printer” that they print to instead, which we’ve managed to break somehow.
We then tried the easiest things first:
- Ask if anyone knows what it is: nobody does
- Check if it exists in the CMDB: it doesn’t
- Check the wiki: no mention of it
In the end, it turned out that about 10 years ago somebody put a physical server in a data center. And the job of that server was to pretend to be a printer. Meaning that when somebody prints it, it saves it to a pdf, and then it runs a script that picks up that PDF and moves it to a mount point. It didn’t make sense to me at the time, and it still doesn’t, but that’s what we had.
In the end, we managed to get the “printer” working again, but not before everyone in the affected department had already gone home for the weekend without being able to finish their work for the week.
What does this story tell us about the limitations of observability and monitoring tools?
No matter how reliable your systems are or how thoroughly you monitor them, outages can and will occur. Monitoring tools are only as effective as the data points they can access. They can provide valuable insights into system performance but they may not capture everything needed when making a change or finding a root cause fix. A lack of data can make it difficult to pinpoint the cause of an outage, especially when the issue is complex. Often involving multiple systems that can be outside our mental model. These unknown unknowns can be particularly challenging to diagnose and resolve leading to lengthy downtimes.
The typical (wrong) response: Risk Management Theatre
When an outage occurs, a common response is to implement more risk management processes in an attempt to stop the outage from happening again. However, this increased focus on risk management processes results in a substantial increase in lead time. Puppet’s State of DevOps report found that low-performing companies that engaged heavily in risk management theatre had 440x longer lead-times than high-performing organisations.
Companies with these long lead times make 46x fewer changes, meaning that each change needs to be much larger in order to keep up. Less practice, and larger changes means that they are five times more likely to experience failures. When failures do occur, the consequences are much more severe.
The combination of larger changes, decreased frequency, and limited experience in handling such situations leads to a mean-time-to-recovery almost 100x longer than that of high-performing organisations. And remember that it was large outages that caused this in the first place, so the process feeds back on itself, making the company slower and slower.
Answer = Inputs
Observability tools that measure outputs such as metrics, logs & traces require a good mental model and a deep understanding of the application in order to interpret them. But as we’ve already seen, outages are often caused by unexpected issues outside of our own mental model. When this happens, the system’s behaviour contradicts out understanding of how it should work. This leads to confusion and requires individuals rebuild their mental model of the system on the fly, as mentioned in the brilliant STELLA report (Woods DD. STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity. Columbus, OH: The Ohio State University, 2017).
To address this challenge, we should shift our focus toward measuring inputs. This enables engineers to create new mental models as needed, whether during the planning stage of a change or in response to an outage. Current tools do not adequately support this type of work. When constructing a mental model, we typically rely on "primal" low-level interactions with the system, often accomplished through the command line, which demands a great deal of expertise and time. To resolve this issue, we must find a way to expedite the process of building mental models by measuring input or configuration changes instead.
If we are to solve this, we must make building mental models much faster, meaning:
- Ensuring that the configuration and current state of a system are readily accessible.
- Enabling users to easily discover the potential impact of their intended changes and what areas might be affected.
- Providing users with the means to validate that their modifications have not caused any issues downstream.
By measuring config changes (inputs) instead we can understand context on demand and have the confidence that our changes won’t have any unintended negative impacts.
Impact analysis
At Overmind, we’ve been building a solution that addresses these challenges. By making the system's configuration and state easily accessible empowering users to confidently make changes without the fear of things going wrong.
Open a pull request
Start by opening a Terraform pull request and Overmind will discover the dependencies of the things you’re going to change. No lengthy scanning processes or agents involved in the setup.
Calculate blast radius
Based on what you're changing, Overmind will calculate blast radius of the affected items. Use the graph to explore relationships and dependencies between these items.
The blast radius contains:
- What infrastructure will be affected.
- What applications rely on that infrastructure.
- What health checks those applications have.
View diffs & validate health
Automatically get alerted if your change breaks something. Including a diff of exactly what changed, how it's related, and how to change it back. Spend less time on validation and letting changes “bake” meaning a faster time to production.
- Quickly identify which changes caused a problem.
- Compare the difference between configurations to uncover the root cause.
- Minimise downtime caused by application breaking changes.
Design Partner Program
After a successful early access program where we discovered over 600,000 AWS resources and mapped 1.7 million dependencies. We are now looking for innovators to join our design partner program to help test impact analysis.
Start by sharing your impact analysis goals with the Overmind team, and see if our program is the perfect fit for you.
If you're interested in getting access before anyone else and influencing the direction of what we're building register here.
Or find out more about Overmind and join our general waiting list here.
Posted on May 25, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.