How to Maintain Quality when Transitioning from Monolith to Microservices

Piece by piece, legacy monolith applications are being broken down and replaced by microservices. Organizations large and small are making the transition, but that doesn’t mean the transition is easy. Until recently, the challenge of transitioning well has held up many organizations, but there are some options to make this easier. In this post, we’ll look at some of the critical metrics to monitor when making the transition, along with helpful tools to get you from monolith to microservices.

The why and the how

Organizations make the switch to microservices for several reasons. When an application is broken into small pieces, those pieces are easier to test and quicker to deploy. With this modularization also comes more clearly scoped responsibilities for developers and teams.

However, even the most motivated and competent company needs to ask the important “how” questions to ensure a successful transition:

How do we maintain quality with a massive code rewrite?
How do we make sense of all the moving parts?
How do we observe our environments?
How do we monitor the impact?

The answers to these questions come down to two primary areas: observability and monitoring. While many developers conflate the two terms, there are some nuanced differences between them.

Observability comes first in the chain. A system or application must be observable before it can be monitored. In a practical sense, that could mean installing OS-level services or agents or, in the case of an application, exposing a /metrics endpoint. Once that critical information is exposed, then it can be monitored. Monitoring tells you what is (or will soon be) broken and how it reached that state.

As you make the transition from monolith to microservices, what should you observe and monitor, and what tools will you use to do it?

What to watch when you’re making the transition

Your transition from monolith to microservices should be transparent to your users. In order to accomplish that goal, your monitoring system should be able to answer certain key questions.

Are we meeting our customers’ needs by providing sufficient uptime and availability?
Are my applications responding quickly enough?
How quickly can we be aware of an issue to troubleshoot it?
How are the developers managing the change?

Let’s look in more detail at each of these and how we can answer them:

Are we meeting our customers’ needs by providing sufficient uptime and availability?

In most cases, you will already have the answer to this question for your current monolith application. You’ll know the amount of uptime for your customer-facing applications, and you’ll know how much downtime is caused by deployments or unplanned outages.

In the context of microservices, tracking uptime is similar but will take more data points to determine as you develop “critical path” microservices. For example, if you extract your login logic as a separate microservice, the availability of the frontend microservice may go up. However, login service downtime will have a significant impact on your users.

In other words, the answer to this question is more complex with microservices, but proper tooling and the ability to trace a request from start to finish will help you get there.

Are my applications responding quickly enough?

Within a monolith application, the moving parts are closer together—all the spaghetti is in the same bowl. A transition to distributed microservices will impact the responsiveness of your applications, since a request no longer travels through a monolith, but instead may spawn several requests to different microservices.

In order to answer this question, you need to monitor your application and your infrastructure, focusing on monitoring intelligence and visualization in your technical management structure. Having a metric from request to result, and tracing it through multiple microservices and systems, will provide you with the insights and answers you need.

How quickly can we be aware of an issue to troubleshoot it?

A breaking issue in a monolith application can bring the entire system to a grinding halt. With a system built on decoupled and modular microservices, however, an issue in one microservice may be latent and escape attention.

Ultimately, the ability to identify issues quickly comes down to the intersection of observability and monitoring. The right parts of a microservice need to be observable so they can be monitored. The alert needs to have pertinent information to speed up troubleshooting and resolution time. For example, a “High CPU” alert with no other information is hardly useful. How much more useful would it be to have an alert that says “High CPU maintained on [system] for [time period]” with a snapshot of processes utilizing a lot of CPU over the past several minutes? This kind of alerting would decrease resolution time substantially.

How are the developers managing the change?

This ties into the high-level points of speed, stability, and sanity from above. Developer sentiment may be less measurable, but it is hugely important. While attrition is a risk at any stage in the lifecycle of a business, it can be a business killer when you’re in the middle of a major transition.

The simplest way to answer this question is through informal conversation with the teams involved or a more formal survey of your developers. Even though your development teams—feeling the monolith pain—are motivated to do this transition, it’s still important to stay in tune with how they’re feeling.

Using tools to help

Tooling is important. No doubt about it. Tooling helps determine and measure the service-level indicators (SLIs) which inherently impact your service-level objectives (SLOs). With good tooling, you can get up and running quickly and with fewer headaches.

Transitioning to microservices should have a net positive impact on your SLIs/SLOs, but the only way to know for certain is through a holistic view of your environment with good observability—and even better monitoring.

Roll your own or open source?

When deciding which tools to use, the reflex of many organizations is to build their own. After all, who knows your observability and monitoring needs better than your own developers or SRE team? The honest truth is that “rolling your own” tooling—especially to be effective and accurate—is terribly challenging and notoriously error prone. Most organizations find that it’s not worth it, and they regret finding out the hard way.

The next best option is to go the open-source route. A Prometheus + Jaeger + Grafana stack will give you a good portion of what you’d need during the transition.

In this setup, you would use Prometheus clients installed on your system or included as libraries in your application code. The clients capture metrics and expose them for a Prometheus server to scrape and store in a time-series database.

Jaeger performs distributed tracing, capturing metrics and data for transactions that wind their way through a system of microservices.

Meanwhile, Grafana works with the Prometheus and the Jaeger data sources to provide visualizations and dashboards.

This open-source setup gives you the opportunity to modify and configure the tools to your needs. It also likely covers some of your general use cases out of the box. The downside here, of course, is that with every tool, you need to keep up with releases, security patches, and configuration drifts, not to mention teaching everyone on the team how to use and maintain each tool. In addition, open-source solutions often bring scaling challenges down the road. As more microservices are built, the costs for both managing the software and storing the telemetry data begin to rise sharply.

Going with the tried and true

For a task this important, having a well-established vendor with monitoring and observability prowess may be the preferred way to go. Some options include:

Splunk: A data platform that is “data source agnostic,” able to ingest metrics, logs, and traces, supporting hybrid-cloud and multi-cloud architectures.
AppDynamics (Cisco): A full stack observability platform that provides visibility into every component of a distributed application, supporting integrations for automated issue mitigation.
Dynatrace: An “all-in-one” platform that handles infrastructure monitoring, application and microservices monitoring and security, and cloud automation.
AppOptics (SolarWinds): AppOptics is an application performance and infrastructure monitoring tool for hybrid and cloud-native environments.
Lightstep: A monitoring and observability platform with a special focus on change management, connecting how code or infrastructure changes affect application performance.

To explore how this plays out in the monolith-to-microservices transition, let’s look at an actual implementation using one of the above tools, Lightstep.

Lightstep has a specific focus on the monolith-to-microservices transition in how they design their products. Lightstep combines both observability and monitoring into one visibility pane that provides a holistic picture of the monolith-to-microservices journey. There are many features directly applicable to microservices. We’ll cover several key features and use the sandbox to see how we can answer some of our questions around maintaining quality.

Change Intelligence

The Change Intelligence feature helps you connect a problem in the application with the specific code change that introduced it. You can review a metric and identify related traces (and even deployments) associated with that metric.

Let’s say you see a CPU utilization spike and want to dive into the details. Here’s what you would do:

Step 1: Identify the anomaly and highlight it.

This gives you the “What caused this change?” prompt. Click the prompt when it appears.

Step 2: Review the changes in the Warehouse service.

We can see on the last line that there was a change in the version of the service captured in this group of traces that coincides with the spike in CPU utilization.

The functionality is easy to understand. With Lightstep, you no longer need to manually correlate cause and effect, or hope that you’ve written your alerting or logging rules correctly. You’ll be able to see traces for all of your microservices and correlate error rates and resource spikes directly to changes in the environment.

Infrastructure and application monitoring

Using the dashboards makes it easier to see all of the moving parts. The Service Diagram enables a holistic view of your microservices. It shows the direction of traffic flow, whether or not errors are occurring, and general metrics of the flow between services.

Distributed Tracing

Another feature worth noting is distributed tracing, which helps you wrangle all your microservices into one view. This will show you the length of each phase, what endpoints it hit, and numerous other bits of information that will help make sense of what went wrong and how to fix it.

Wrap up

If your organization’s transition from monolith to microservices hasn’t happened yet, it will likely happen soon. The challenge for every organization is to ensure the transition is seamless while also maintaining (or improving) the quality of your application from start to finish. Observability and monitoring are key to maintaining quality, and that’s hard to come by without a comprehensive tool that provides insights and a holistic view of the changes you make along the way. Fortunately, tools like Lightstep’s provide that bird’s eye view for developers, allowing for greater continuity and smoother transitions.

Blog