How to prevent database problems from hitting production?

Today’s cloud-native tech can empower engineering teams to iterate, at scale, faster than ever. As developers, we would like to focus on building great apps and spend less time on the infrastructure and operations. Organizations want to ship new versions at high velocity. Many things might break with every release, but when the database breaks, the complexity of the problem and the time it takes to fix it are much worse than problems in other parts of the application.

Let’s discuss common problems that might break our database, and what we can do to prevent these problems as soon as possible (“shift them left” as much as possible).

Remember that in most cases a rollback is simply not an option in the database world. If the data had changed, say the production environment ran for 3 hours, and only then we discovered a big problem, how can we roll back without losing all the changes that happened in these hours? We can technically recover from a backup, but that would bring the system back to the state from before the deployment. What about changes that happened in the time after we deployed the invalid code?

First, A Recount of The Most Common Database Issues

Inefficient design leading to data loss
Unstable changes in schema via migrations
Slow queries lacking indexes or proper configuration
Bugs leading to decreased data quality

Protect Your Database Today

Let’s see how we typically protect ourselves from the issues on the database level.

Solid design

We can start by evaluating the impact of the new code on the production database. Potential problems typically fall into one of the following categories.

Data loss

The top priority and number 1 concern is losing data. We can consider the following questions:

Will the new version accidentally delete a table?
Will the new version delete a column?
New requests (REST/GQL) may generate UPDATE and DELETE commands. Will these commands change only the data they should? In other words, how can we be sure the WHERE conditions will work as the product team planned?

We typically modify database schema with Infrastructure as Code (IaC) like Terraform or CDK, so it’s more likely that we’ll catch issues 1 and 2 in the staging. However, issue 3 is much harder to spot because it happens in the application runtime.

Schema changes

There are two important effects of schema changes (aka migrations).

First, they may cause permanent errors. The application server might not be able to successfully finish operations. There may be multiple reasons, like these:

New name for a table/column/view/function
A different number of columns (usually after new columns were added to the table) prevents INSERT or UPDATE operations, which expect a different number of columns, in a particular order.
A different data type of a column prevents INSERT or UPDATE operations from completion

To fix those problems you’ll have to quickly release another version that changes the application or the database.

Second, schema changes cause the database to be temporarily unavailable. The problem will be fixed by itself, but it might take hours for the database to finish the operations and no rollback is available. Operations prone to this issue are for instance:

Creating a new index, or changing an existing one, on a large table.
Changing the order of the columns or the data type of a column in a table. These operations may internally create a new table with a new structure and then copy the data to it. While the database engine manages all of that under the hood, the operation might take a long time when large tables are involved.

These issues may go unnoticed in staging, as it typically has much less data, and doesn’t get the load constantly.

Inefficient queries

The new code may use different SQL queries. This may pause the following problems:

The ORM auto-generated SQL queries might be not efficient at all.
The SQL command reads a lot of data, usually because of a missing index. Sometimes the index exists but the Query Optimizer still decides not to use it, for various reasons.
The SQL command returns a large data set: many rows, many columns or a few columns with long strings (JSON, images)
The SQL command joins many tables, but not all of them are necessary. This is a common problem with ORM frameworks when the developers rarely check the impact of the models on the underlying database.
The SQL command sorts a large number of rows
The execution plan is non-efficient (high cost), uses many loops for data retrieval, or uses a non-efficient JOIN type. Sometimes that happens even if all the statistics are up to date.

These queries may be fast enough in the staging but may turn out to be too slow the moment they hit the production database. We may catch the problem using load tests, but it’s still an approximation, as we often don’t have production size dataset in the staging.

Data quality

We may inadvertently decrease data quality when the application server changes the data using the wrong logic. A typical example is using floating-point types (like double) to represent the price, or rounding errors when multiplying. Other issues involve inconsistent serializing and parsing (client sends date as month/day whereas server considers it day/month), encoding around foreign characters, missing values (NULL vs ‘null’), or numerical differences (3,14 vs 3.14).

These issues may go unnoticed for a long time, especially when we’re working in an Extract-Transform-Load domain. Seemingly everything works well, but the issue just increases over time.

Contention

Some issues manifest themselves only under a heavy load. If we don’t stress-test the database, we may miss lock-related problems. This may be even harder in staging environments because we may not be able to replicate a true distribution of production traffic. Differences in locales, used APIs, or sizes of requests may cause the issue to go unnoticed.

Test & Deploy

A high-quality QA is needed to cover the long list of potential problems listed above, focus on preventing data loss and data quality

Write the QA tests and expected behaviors
Monitor the duration of each operation of changing the schema
Deploy the new version
Analyze the logs to check what errors, if any, had occurred during the deployment

A root cause of many QA tests that should have failed but didn’t is using a small test DB that doesn’t look at all like the production DB. Queries run quickly and hard to detect data quality problems. Generating a DB similar to the production DB is not easy due to privacy and data protection regulations.

Observability

The classic monitoring of the Production db

Main performance counters: CPU, memory, IO, tables usage, number of connections
Error logs
Changes in the duration of the REST / GQL commands. The application server must generate traces to map the REST / GQL to the underline SQL commands.
Collecting the Slow Query Log
Collecting the Execution Plan (usually sampling around 5%) to get deeper insights into why a query runs slowly
Correlate the deployment time of new versions with anomalies in the DB behavior.

What makes a good solution for Database Guardrails?

To address all the potential problems listed above, busy developers need a good solution. Let’s start with some guidelines of what makes a good solution and then learn how to implement one. Some of the guidelines might look trivial but their implementation is far from that.

Shift Left

Shift Left is a practice intended to find and prevent problems early in the software delivery process. The idea is to improve the quality by moving tasks in the lifecycle to the left as early as possible. Since database-related problems tend to have a long Mean Time to Resolution (MTTR) in the production environment, finding the problem during development becomes very important.

Many database-related problems are not detected in the test environment as the test database doesn’t represent well the size, cardinality (data distribution), and data complexity of the production environment.

Smart Monitoring

Since we want to detect problems as early as possible, another key element of the solution is to monitor the production environment smartly. We can focus on the recent changes more to detect problems as soon as possible. Not every problem results in the CPU jump to 100% or 1000s of errors. Therefore, on top of the basic, general monitoring, the system should focus on the changed objects and carefully search for anomalies in performance and data quality.

Be proactive and provide context

Be proactive - when something goes wrong, the owner of the problem should know about it immediately. In reality, it is hard to expect developers to open the monitoring tools with every release (which happens 5+ times a week) and search for problems. The relevant data should be proactively sent to them, into the tools and the workflow they use.

Provide context - It is not enough to send data proactively to the person who is responsible for solving it. Instead of showing row data, broken into siloed data sets, the system can help identify correlations to save time. Part of the application becomes sluggish. Let’s see what changes in the code were made recently and which part of the database was affected. The CPU is high? Let’s see if something in the configuration or schema was changed recently.

Yet another step is Automated investigation. It is more than just providing the context of the problem. It is a predefined workflow of collecting raw data, usually from multiple sources, and analyzing it. The automated investigation process should suggest potential root causes, and bring the data to support or deny the suggestion.

Consider execution plans. Since many developers are not familiar with DB internals, analyzing the execution plan of a query might be a time-consuming task. Therefore the Automated Investigation flow can retrieve the execution plan, analyze how efficient it is, and highlight the problems, such as table scans.

Finally, suggest a Remediation Plan. The highest level of a proactive observability solution is to suggest to the users how to solve the problem. Provide scripts to run and instructions on how to monitor the effectiveness of the suggested solution.

Implementing a Proactive Database Observability Solution

Now that we have a good understanding of what a proactive database observability solution means, we can look at the tools and technologies that can be used to build one. Unfortunately, there is no one magic product to solve all the problems but many tools that “talk” with each other.

Run Tests on a Database Similar to the Production database

Use a database that is large enough and contains wide data distribution so query times resemble the production environment. If the tables are not big enough, consider using open-source libraries, such as Faker, to quickly generate millions of rows.
Monitoring the execution time itself is not enough and you should analyze the execution plan too to understand how efficient the query is and how it will perform in production, running many times a day, concurrently. Tools such as explain.dalibo, depesz or Metis can help.
Integrate data quality tools, such as Great Expectations, to automate testing of what data is returned in the results set and what data is modified in the tables, for each REST / GQL command.

Run Tests on a Copy of the Production database

Another approach is to generate a copy of the Production database and run the Quality Assurance tests on that database. Testing on a copy on the production database gives a clear understanding of how the queries will perform, and how much time it would take to change the schema.

The main cloud providers can generate a copy of the database. Some vendors do not require a manual backup and restore of the database, which might take hours, but support advanced technologies to create the database quickly. A much more complicated challenge is privacy. To support regulations, such as HIPAA, CCPA, or GDPR, the testers should not have access to the production data.

One solution is applying masking on the data. The main cloud vendors and/or the database engine providers support masking sensitive columns. We can also implement the masking using a proxy, such as Masquerade. But masking can be a complicated task, as the legal team might be involved to ensure we don’t violate any regulations.

We can also test the performance using the metadata. The test doesn’t need to return the actual data of the query, it only needs to return how much time it took, the cost (IO and CPU), and how many rows were scanned and returned. Or how long it took to create an index and change the schema of a table.

Provide Context

We mentioned that providing context is important. Databases don’t live in a vacuum, the SQL commands are generated by code-handling REST or GQL commands. A good observation solution uses traces, generated by the web app and the DB, to show which REST / GQL command generated the SQL commands.

Generating those traces is easy using OpenTelemetry. OpenTelemetry is an open-source observability framework. It offers vendor-agnostic or vendor-neutral APIs, software development kits (SDKs) and other tools for collecting telemetry data from cloud-native applications and their supporting infrastructure to understand their performance and health. OpenTelemetry provides a universal specification of metrics, logs and traces, supported by many vendors.

The community provides and maintains SDKs for all main ORMs and Middleware Frameworks. The instrumentation is very simple and well-documented. Some SDKs even provide an auto-instrumentation, with 2-3 lines of code. Application Performance Monitoring (APM) also provides the code to send the traces to their backend. DataDog, Jaeger, Logz.io and SigNoz and Aspecto, to name a few.

Example of a Trace. Every part of the trace is called a span.

(The image was taken from the blog post: Jaeger Tracing: A Friendly Guide for Beginners):

Traces like the ones above show the true history of the user flow. This gives us a better understanding of what’s running around the system.

Automated Investigation

Database Observability only means the raw data is underhand. The next step is writing scripts for automated investigation of the data and the root cause. With such predefined workflows, the user can save a lot of time and reduce the cognitive load by getting immediate answers. Even getting an automated answer: “This is not the root cause and here is the data to support that claim” helps a lot.

Let’s see an example of an automated investigation performed by Metis:

We can see insights immediately showing what may be wrong with the query. This simplifies our reasoning about the problem and allows us to arrive at solutions much faster.

Conclusion

Database problems in production can be hard to solve therefore developers should take the necessary steps to prevent them and discover them as soon as possible. Luckily some solutions can help busy developers stay on top of the development lifecycle, and release new versions with confidence.

If you are interested to learn on how database observability can help your engineering team stay on top of their databases, ship better code and reduce MTTR check out Metis for more information

Blog