Master is the new Prod, Devs are the new Ops
Bryan Lee
Posted on April 2, 2020
Written by Borja Burgos
Operational concerns of shipping software and keeping production up and running have been largely automated. Ultimately, even when things go wrong, modern-day monitoring and observability tools allow engineers to triage and fix nasty bugs faster than ever before. In part due to this, engineering teams and their budgets are shifting left. With the fear of downtime in the rearview mirror and operational challenges largely at bay, businesses are investing heavily in development to increase the quality and speed at which business value is delivered. The biggest bottleneck? Your failing master branch.
The Ops Wayback Machine
The year is 2004. It’s time for deployment. You’re confident about the software you’ve written. You clear your mind, get one last sip of coffee, and get ready to deploy. Before you proceed, you open up terminals, many terminals, and tail every log file on every server that could possibly be affected. You have business metrics up and running on your second monitor. Next to it, there are infrastructure and application-level metrics. You hit the return key and proceed to deploy the latest release to a single server. Now you watch.
You scan rapidly across every terminal in your 19'’ CRT monitors to identify patterns and look for discrepancies: are there any errors in the logs? Has the conversion rate changed? What’s the load on the server? Was there a change in disk or network i/o? You wait a few minutes, if everything looks good, you proceed with a second server, else you roll back to the previous release as fast as humanly possible, hoping that’ll fix whatever it is that broke.
While the excerpt above may sound ridiculous in this day and age, that’s how many deployments were done in the good old days of the internet after the Dot-com bubble. Then came automation, hand-in-hand with the proliferation of “DevOps” and “microservices”. These practices were born from the ever-increasing speed and complexity of the applications being developed, a consequence of businesses’ competitive desires to deliver more value to customers, faster. Businesses could no longer afford to ship new features every 6 months; a slow release cycle represented an existential threat.
But businesses weren’t always concerned with shipping new features at all costs. If anything, that wasn’t even a priority back in the early 2000s. The biggest fear of any company running software on the internet has always been downtime (ok, maybe the second biggest fear, with a security breach being at the top). And for this reason, among others, ops teams have always had sizable budgets for tech companies to sell into. After all, anything that minimizes downtime is likely cheaper than downtime itself.
Fast-forward to today and you’ll see a whole new world. A world in which high-performing engineering organizations run as close to a fully-automated operation as possible. While every team does things a little bit differently, the sequence mimics the following: the moment a developer’s code is merged into master, a continuous integration (CI) job is triggered to build and test the application; upon success, a continuous delivery process is triggered to deploy the application in production; oftentimes this automated deployment is done in a minimal fashion, just a few nodes at a time or what’s known as a canary deployment or release; in the meantime, the system, equipped with the knowledge of thousands of previous deployments, automatically performs multi-dimensional pattern matching to ensure no regressions have been introduced; at a certain degree of confidence, the system proceeds to automatically update the remaining nodes; and if there are any issues during deployment, rollbacks are automated, of course.
Automation doesn’t necessarily stop with the deployment and release of an application. More and more operational tasks are being automated. It’s now possible, and not uncommon, to automate redundant tasks, like rotating security keys or fixing vulnerabilities the moment a patch is made available. And it’s been many years since the introduction of automatic scaling to handle spikes in CPU or memory utilization, to improve response time and user-experience, or even to take advantage of discounted preemptible/spot instances from IaaS providers.
But, as anyone who has ever had to manage systems will be quick to tell you: no amount of testing or automation will ever guarantee the desired availability, reliability, and durability of web applications. Fortunately enough, for those times when shit ultimately hits the fan (and it undoubtedly will), there is monitoring to help us find the root cause and quickly fix it. Modern-day monitoring tooling — nowadays being marketed as “observability” — offers fast multi-dimensional (metrics, logs, traces) analysis with which to quickly debug those pesky unknown-unknowns that affect production systems. Thanks to observability, in just a handful of minutes, high-performing engineering organizations can turn “degraded performance”, and other problematic production regressions, to business as usual.
The amount and complexity of the data that an engineer can rapidly process using a modern observability tool is truly astounding. In just a few keystrokes you can identify the specific combination of device id and version number that results in that troublesome backend exception, or which API endpoint is slowest for requests originating from Kentucky and why, or identify the root cause behind that seemingly random spike in memory consumption that occurs on the third Sunday of every month.
While this level of automation and visibility isn’t achieved overnight, we’re unequivocally headed this way. And it is the right path forward, as demonstrated by the adoption of these tools and processes by the world’s most innovative tech companies. And so, with operational concerns at ease, what’s next? Business value! And what’s the biggest bottleneck we’re currently facing? Failing to keep master branch green (i.e. healthy/passing CI). Let me explain.
Keeping master green
In terms of development, testing, and continuous integrations, the closest software engineering concept to “keeping production up” is to “keep master green”, which essentially means it’s deployable.
This is something that’s likely to strike a chord with most software developers out there. It makes sense after all; if teams are going to cut releases from master, then master must be ready to run in production. Unlike years ago, when releases were cut far and few between, the adoption of automation (CI) and DevOps practices has resulted in development teams shipping new software at a much faster rate. So fast, that high performing engineering organizations are able to take this practice to its extreme, by automatically releasing any and every commit that gets merged to master — resulting in hundreds, sometimes even thousands, of production deployments on a daily basis. It’s quite impressive.
But you might be left wondering, if you’re not continuously deploying master, and instead, you’re doing so once a week, or maybe once a month, why bother keeping master green? The short answer is: to increase developer productivity, to drive business value, and to decrease infrastructure costs. Even if you’re not shipping every commit to master, tracking down and rolling back a faulty change is a tedious and error-prone task, and more often than not requires human intervention. But it doesn’t stop there, a red (broken) master branch introduces the following problems:
A red master leads to delayed feature rollouts, which themselves lead to a delay or decrease in business value and potential monetary loss. Under the assumption that CI is functioning as expected, breaking master means that a faulty or buggy code commit needs to be detected, rolled back, and debugged. And for many companies, the cost of delaying the release of a new feature, or [security] patch, has a direct correlation to a decrease in revenue.
A broken (red) master branch has a cascading negative effect that also hurts developer productivity. In most engineering organizations, new features or bug fixes are likely to start as a branch of master. With developers branching off a broken master branch, they might experience local build and test failures or end up working on code that is later removed or modified when a commit is rolled back.
A broken/failing build is also money wasted. Yes, automated builds and tests are performed precisely to catch errors, many of which are impractical to catch any other way. But keep in mind that for every failed build, there’s at least another build (often more) that needs to run to ensure the rollback works. With engineering teams merging thousands of commits every day, build infrastructure costs can no longer be disregarded — at some organizations, CI infrastructure costs already exceed those of production infrastructure.
Convinced of the perils of a red master branch, you may ask yourself, what can I do to keep things green? There are many different strategies to reduce the number of times that master is broken, and when it breaks, how often it stays broken. From Google’s presubmit infrastructure and Chromium’s Sheriff, to Uber’s “evergreen” SubmitQueue, there’s no doubt that the world’s highest performing software organizations understand the benefits of keeping master green.
For those that aren’t dealing with the scale of Google, Facebook, and others their size, the most widely established, and easily automated approach is to simply build and test branches before merging to master; easy, huh? Not really. While this often works for relatively simple, and monolithic codebases, this approach falls short when it comes to testing the microservice applications of today. Given the inherent distributed complexity of microservice applications, and the rate at which developers are changing the codebase, it is often impractical (i.e. too many builds that would take too long) to run every build, not to mention running the full integration and end-to-end test suites on every commit for every branch for every service in the system. Due to this limitation, branch builds are often limited in the scope of their testing. Unfortunately, after the changes are merged and the complete test suite is executed, it is common for this approach to result in test failures. And back to square one, master is red. So what can you do?
There’s been an outage in master
At Undefined Labs, we spend a lot of our time meeting and interviewing software engineering teams of all sizes and maturity levels. These teams are developing systems and applications with varying degrees of complexity, from the simplest mobile applications to some of the largest distributed systems servicing millions of customers worldwide. We’ve seen an interesting new pattern emerge across several software engineering organizations. This pattern can be summarized as:
“Treat a failure (i.e. automated CI build failure) in your master branch as you would treat a production outage.”
This is not only exciting but indicative of how engineering organizations continue shifting left, further and earlier into the development lifecycle.
Most software teams already aim to “keep master green”, so what’s the difference? Traditionally, keeping master green has been on a best-effort basis, with master often failing over multiple builds across different commits, and days going by before returning to green.
As engineering teams mature, it becomes even more urgent to fix master in a timely manner. Now, if master is red, it’s a development outage, and it needs to be addressed with the utmost urgency. It is this sense of urgency and responsibility that development teams have set upon themselves to get things back to normal as quickly as possible that is truly transformative. Given the business repercussions, associated costs, and productivity losses of a failing master branch, we expect broad adoption of this pattern across engineering teams of all sizes in the near future.
Let’s take a look at some practical things you can do to keep master green:
If you can’t measure it, you can’t improve it
This is true for many things, and it’s particularly true here. Before setting out on a journey to improve anything, one needs to be able to answer these questions:
How long does it take us to fix master when it breaks?
How often do we break master?
These questions may sound familiar. Change master to “production” and these questions have well-known acronyms in the operational world. For anyone that has hung around an SRE (site reliability engineer) long enough, the terms MTTR and MTBF come to mind.
MTTR or mean time to repair/resolution is a measure of maintainability. In the context of this article, MTTR answers the question, how long does it take to fix master? MTTR starts a running clock the moment a build fails and only stops when it’s fixed. As failures continue to occur, the MTTR is averaged over a period of time.
MTBF, or mean time between failure, is a measure of reliability. In the context of this article, MTBF answers the question, how often do we break master? MTBF starts a running clock the moment a build goes from failing to passing and stops the next time a build fails. For example, for a team that breaks master once every week, the project’s MTBF will be approximately 7 days. As the project continues to experience failures, the MTBF is averaged over a period of time.
Unfortunately, until now, there’s no easy nor automated way to monitor these metrics. Few CI providers/tools provide insights into related information, and even fewer provide an API with which to more accurately calculate these metrics over time.
In Scope, we’re adding both MTTR and MTBF to our user’s service dashboards. That is, for every service for which Scope is configured, teams automatically see these values and trends tracked over time.
Now that you know what “normal” is, it’s time to figure out ways to improve MTTR and MTBF, in the never-ending quest to keep master evergreen.
MTTR — if you can’t debug it, you can’t fix it
MTTR is a measure of maintainability: how quickly a failure can be fixed. To fix a software problem, software engineers need to understand what the system was doing at the time of the error. To do this in production, SREs and developers rely on monitoring and observability tools to provide them with the information they need in the form of metrics, logs, traces, and more to understand the problem at hand. Once the problem is well understood, implementing a fix is usually the easy part.
However, the contrast between production and development is quite stark. In CI, developers lack the detailed and information-rich dashboards of production. In CI, developers get a “build failed” notification, followed by a dump of the logs of the build. Here is an example output from a top CI provider:
The problem is evident. Whereas operationally we have an endless amount of data, visibility, and infinite cardinality to probe the system and quickly understand complex production issues, when it comes to CI, developers are left in the dark, with nothing more than a log dump.
To deal with this problem, before moving to Scope, the organizations we’ve encountered had either built a custom solution, frustrated by the lack of options in the market, or were trying to jerry-rig their production monitoring tools to work for CI without much success. There are key differences between production, a long-lived system handling transactions, and CI, short-lived and fast-changing environments running unit and integration tests, that make using production tooling impractical in CI.
Teams interested in reducing the mean time to resolution in development to increase productivity, reduce CI costs, and increase business value, ought to look no further. Scope provides low-level insights into builds, with visibility into each and every test, across commits and PRs. With distributed traces, logs, exceptions, performance tracking and much, much more, teams using Scope are able to cut down their MTTR by more than 90%.
MTBF — a reliably flaky build
MTBF is a measure of reliability: how often master breaks. After meeting with countless teams, and reviewing their CI builds, the verdict was clear: the leading cause for build failures in today’s software development teams is flakiness. If something is flaky, by definition, it cannot be reliable. As such, to increase a project’s MTBF the best thing any engineering organization can do today is to improve how flakiness is managed in a test suite and its codebase.
Lots have been written about flakiness. Most recently, Bryan Lee, from Undefined Labs, has written a great primer on flakiness; you should read it. In this post, Bryan lists a clear set of patterns to successfully handle flakiness:
Identification of flaky tests
Critical workflow ignores flaky tests
Timely flaky test alerts routed to the right team or individual
Flaky tests are fixed fast
A public report of the flaky tests
Dashboard to track progress
Advanced: stability/reliability engine
Advanced: quarantine workflow
Due in part to the limited visibility into CI that most development teams have today, the way developers deal with flakiness is quite rudimentary. When a build fails, and if the developer suspects a flaky test is the reason behind it, they simply hit that retry button. Again and again, until the build passes, at which point they can proceed with their job. Not only is this wasteful, and costly, from an infrastructure perspective, it’s also highly inefficient and unproductive, as developers may be blocked while a build is taking place.
Other solutions, less rudimentary, but still largely ineffective, require manual intervention from developers to investigate builds in order to identify flakes; however, these flakes are not properly tracked, and rarely dealt with, given the overhead experienced by teams without testing visibility. Without the means to quarantine or exclude known flaky tests from builds in master, teams with flaky tests are quick to give up on any ambition to keep master green. This, of course, carries grave business consequences as the team’s ability to innovate and ship new features is hindered by their broken builds.
While this may be the case for most, there are those high-performing engineering organizations that have built internal tooling to address these challenges. Google, Microsoft, Netflix, Dropbox, among others, have built custom solutions to deal with flakiness and minimize the frequency at which master is red. The problem? These solutions are custom-built and not readily available for the rest of us.
To address this glaring problem, we’re adding flaky test management features right into Scope. Starting with flaky test detection, automatic test retries, and a dashboard to track all flaky tests. In this dashboard, teams can track every flaky test, their flaky rate, the date and commit in which the test first exhibited flakiness, as well as the most recent flaky execution.
Another one of the biggest struggles developers face when attempting to improve very flaky codebases and test suites are: where do I even begin? While all flaky tests may seem as good candidates for fixing, prioritization is key when dealing with flaky test suites. If you’re actively experiencing flakiness across hundreds of tests, it’ll be a futile attempt to try and fix them all. Instead, development teams should prioritize those tests that experience the highest rate of flakiness and are also the slowest to execute. Everything else being equal, these are the tests that have the most negative impact on your application and development processes.
Our journey to the left
At Undefined Labs we’re big believers in shifting left, and are always looking for innovative and transformative ways to improve application development. Treating master as production may still raise a few eye-brows in this day and age, but as shown in this article, there is clear business value in doing so: faster feature rollouts, increased developer productivity, and reduced costs. Similarly, once-reserved-for-production indicators like MTTR and MTBF, feel right at home as part of the development process, and provide development teams with the responsibility and accountability they need to more efficiently run their operations.
When building Scope, we often ask ourselves: what happens when you apply the modern and sophisticated tooling that we have in production to problems in development? what happens when you close the feedback loop, and use everything we know about our applications running in production to make more informed decisions during the development process? The possibilities are endless, and quite exciting! If this sounds interesting, make sure to follow and stay tuned for more coming from us very soon!
In the meantime, happy testing!
The way we build applications has drastically changed with the rise of DevOps, Microservices, and Cloud Native — but the pinnacle of developer testing has remained static: run every test for every commit in CI and get no visibility whatsoever when things go wrong.
We’re building Scope to give teams a modern testing platform that provides a solution to the biggest pains in testing:
Debugging unit and integration tests
Identifying and managing flaky tests
Reducing time spent testing by 50–90% (leading to a dramatic decrease in CI costs)
Detecting regressions before they reach production
Keeping master green
Posted on April 2, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.