Running Database Changes With Zero Downtime

lob

Lob

Posted on April 27, 2022

Running Database Changes With Zero Downtime

At Lob we receive millions of HTTP requests from our customers daily. Our Print and Mail API facilitates the print and delivery system that is our core product. These requests lead to the creation of new records that represent letters, postcards, self mailers, accounts, and many other mailable and non-mailable resources. Over the last eight years, many of our database tables have accumulated a massive volume of data. Herein lies the problem.

A naive approach to executing column changes

We needed to update multiple json field types to jsonb recently. I had never worked with a database that contained more than tens of thousands of entries, so I naively thought that the process would be easy.

My original approach was to do something like this for each table:

I quickly ran into problems. At Lob, we pride ourselves on API availability. However, my approach would lock tables and create issues with our availability because ALTER TABLE statements in Postgres hold an ACCESS EXCLUSIVE type lock while the operation runs. ACCESS EXCLUSIVE type locks are the most restrictive table-level lock mode for all table-level locks. While an ACCESS EXCLUSIVE lock is held, no concurrent query statements can be executed on the table. Bear in mind that our tables have a massive volume of data. To withhold the use of these tables for a column type change would essentially deny the use of our Print and Mail API for customers for hours, possibly days.

Approaching column changes with zero downtime

My approach for tackling this zero downtime constraint was inspired by PostgreSQL at Scale: Database Schema Changes Without Downtime.

Below is my approach in consideration of the zero downtime constraint:

  1. Run a database migration script that creates a new column of type jsonb for each json column in affected tables.

  2. Make changes to our Print and Mail API to dual write to both the json and jsonb columns.

  3. Backfill the value of the new jsonb column with the pre-existing json column.

  4. After backfilling the jsonb columns, run a database migration script to drop the json column and rename the jsonb column to the name of the json column. Note: It is important to do both the ALTER TABLE … DROP COLUMN queries and the ALTER TABLE … RENAME COLUMN queries in a single transaction. Other queries pertinent to our Print and Mail API’s operations could execute queries on the table in between the time it takes for a table’s json column to be dropped and the _jsonb _column to be renamed. These queries could potentially try to write to the old column name but because we haven’t renamed the old column’s name to that of the old column’s, the query would fail

  5. Change the API to write only to the new column

You might be wondering “Why am I writing to both columns in Step 2”? Because zero downtime is the most important consideration in this project. Since new records can be added to the tables we are changing while we are changing them, we need to account for new data in our execution of the changes.

I used Node.js with the inclusion of a package called Knex, a SQL query builder for relational databases. Migration scripts are executed using the Knex command line interface. Below is the code as it corresponds to each step of the process:

Step 1

Create a new column of type jsonb for each table with a json column

Step 2

Enable dual writing to both json and _jsonb _columns.

Example in pseudocode (Wherever the controller method is for creating a new record into a table with a _json _typed column):

Step 3

The backfill script. This script is executed via the command line like so:

node backfill-jsonb-columns-script.js <loop_count> <batch_size> <timeout_delay in ms>

  • loop_count: How many times to execute the update query
  • batch_size: How many records to update in one UPDATE statement
  • timeout_delay: How long to wait between each batch update ‍

Step 4

Drop the json column and rename the jsonb column to the name of the dropped column.

Step 5

Revert changes made in Step 2.

Considerations

In order to make the column type change from json to jsonb,_ _we need to consider two things for the query statements we execute.

  • _Would the statement lock the table from other concurrent query statements? _
  • If the table is locked, is it possible to perform the operation quickly?

In order to answer these questions, we need to understand what SQL queries are being executed at each step of my approach.

Step 1

Create a new column of type jsonb _for each table with a _json column.

SQL Statements:

ALTER TABLE… ADD COLUMN…

ALTER TABLE… ADD COLUMN…

...

Would the statement lock the table from other concurrent query statements?

Yes. This type of ALTER TABLE… **statement acquires an **ACCESS EXCLUSIVE type table lock before executing.

If the table is locked, is it possible to perform the operation quickly?

Yes.** ALTER TABLE… ADD COLUMN… **executes very quickly since all rows with the new column are given a default value of NULL.

Solution: No need to remedy. All tables should be locked for a very short period of time.

Step 2

This step does not execute any SQL queries.

Step 3

The backfill script.

SQL Statements:

UPDATE… SET…

UPDATE… SET…

Would the statement lock the table from other concurrent query statements?

Partially. Reads to the table can happen concurrently but writes cannot. INSERTS **and UPDATES in Postgres acquire a **ROW EXCLUSIVE **type table lock before executing. Although the lock is called **ROW EXCLUSIVE, it affects the entire table. ROW EXCLUSIVE **locks block out any concurrent query that is not a **SELECT… query.

If the table is locked, is it possible to perform the operation quickly?

Negative. If we were to do a single UPDATE… SET… query for every table. It would take hours to execute the queries and would inevitably result in downtime. Being able to read from the tables concurrently while the UPDATE… SET… queries are being executed is great but being unable to do write operations to those tables would lead to a denial of our service.

Solution: The remedy to this can be seen in the code for the backfill script. UPDATE… SET… queries are executed in a while loop and done in batches. A timeout is set in between each UPDATE… SET… query so that other write operations can be executed on tables affected by the backfill script. The script is able to be run and stopped as many times, which means we can execute the backfilling process periodically and over different spans of time rather than all at once.

Step 4

Drop the json column and rename the jsonb column to the name of the dropped column.

SQL Statements:

`ALTER TABLE... DROP COLUMN…

ALTER TABLE... DROP COLUMN…

ALTER TABLE... RENAME COLUMN…

ALTER TABLE... RENAME COLUMN…

Would the statement lock the table from other concurrent query statements?

Yes. Query statements are executed within a single transaction, which locks the entire table.

If the table is locked, is it possible to perform the operation quickly?

Yes. Both statements are very quick to execute.

Solution: No need to remedy. All tables should be locked for a very short period of time.

Step 5

This step does not execute any SQL queries to our database.

Conclusion

Like most organizations that possess a massive volume of data, Lob needs to ensure that our services have zero downtime. To do so, Lob software engineers need to consider the database changes they are making. Migrations and backfill scripts need to be planned in advance to ensure availability.

Editor’s Note: Jay Huang was a summer intern with Lob and during his time he worked on several important projects, including allowing users more control over the privacy of their data, improving our webhooks platform, and upgrading critical dependencies.

💖 💪 🙅 🚩
lob
Lob

Posted on April 27, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related