Scenario 2: Mastering Backfilling for New dB Columns – Update Like a Pro!

chidioguejiofor

Chidiebere Ogujeiofor

Posted on September 25, 2024

Scenario 2: Mastering Backfilling for New dB Columns – Update Like a Pro!

Cover Photo Credit: Pixels

Introduction

One of the most interesting challenges I’ve encountered is when adding new fields to an existing database table and needing to back-fill values for those fields. Often, back-filling involves performing some sort of aggregation or calculation—sometimes involving other tables or columns—and then updating the target table with those calculated values.

Let’s explore a concrete example using some sample data.

Problem Example

Imagine we decide to add two new fields to our events table:

  • number_of_ticket_types: This will hold the count of different ticket types associated with each event.
  • number_of_tickets_available: This will represent the total number of tickets still available for the event, calculated as the sum of ticket.quantity minus ticket.quantity_sold for each event.

These fields can be useful if, for instance, you are retrieving a list of events and need to display ticket information without requiring a user to click into the details of each event.

We’ll start by adding a migration that adds these new columns to our events table:

Fields Type Constraints
title text
owner_id text FK -> Users
id VARCHAR(30) PK
number_of_ticket_types INT default:0
number_of_tickets_available INT default:0

However, when these new fields are added, their values will be set to 0 for all existing events, even though we already have tickets in the tickets table. Now, we need to backfill those fields with the correct data.

A First Attempt at Backfilling

One way to backfill these new fields is to retrieve data from the tickets table, perform calculations, and then update the events table accordingly.

Here’s an example of how to backfill the number_of_ticket_types field:

async function backFillTicketTypes() {
  const ticketTypeResult = await TicketsModel.findAll({
    attributes: [
      "event_id",
      [sequelize.fn("count", sequelize.col("id")), "ticket_type_count"],
    ],
    group: ["Tickets.event_id"],
  });

  /*
    Result would look like this:
    ticketTypeResult = [ 
        {event_id: 'eventId1', ticket_type_count: 10},
        {event_id: 'eventId2', ticket_type_count: 4},
        {event_id: 'eventId3', ticket_type_count: 3},
    ]
  */

  for (const eventTicketTypeObject of ticketTypeResult) {
    const { event_id, ticket_type_count } = eventTicketTypeObject;
    await EventsModel.update(
      {
        ticketTypeCount: ticket_type_count,
      },
      {
        where: {
          id: event_id,
        },
      }
    );
  }
}
Enter fullscreen mode Exit fullscreen mode

The second field, number_of_tickets_available, can be backfilled similarly by calculating the total available tickets:

async function backFillNumberOfTicketsAvailable() {
  const ticketsAvailableResult = await TicketsModel.findAll({
    attributes: [
      "event_id",
      [sequelize.fn("sum", sequelize.col("quantity")), "total_tickets"],
      [
        sequelize.fn("sum", sequelize.col("quantity_sold")),
        "total_tickets_sold",
      ],
    ],
    group: ["Tickets.event_id"],
  });

  /*
    Result would look like this:
    ticketsAvailableResult = [ 
        {event_id: 'eventId1', total_tickets: 10, total_tickets_sold: 2},
        {event_id: 'eventId2', total_tickets: 23, total_tickets_sold: 12},
        {event_id: 'eventId3', total_tickets: 14, total_tickets_sold: 8},
    ]
  */

  for (const numberOfTicketAvailable of ticketsAvailableResult) {
    const { event_id, total_tickets, total_tickets_sold } =
      numberOfTicketAvailable;
    await EventsModel.update(
      {
        number_of_ticket_types: total_tickets - total_tickets_sold,
      },
      {
        where: {
          id: event_id,
        },
      }
    );
  }

  // Run update query...
}
Enter fullscreen mode Exit fullscreen mode

What's Wrong with This Approach?

While this method works, it’s quite verbose and involves multiple database queries:

  1. A query to retrieve ticket_type_count for each event.
  2. A query to retrieve total_tickets and total_tickets_sold for each event.
  3. An UPDATE query for each event that has been created in the database. This is a really expensive task if many items have been created in the database previously

In the above example the total number of dB calls is dependent on the total number of events that exists in the application.

Which would greatly slow down our server

Note: If we use the techniques from the previous article(Senerio 1), we would be able to reduce the number of UPDATE query calls to 1 per function. Which would reduce the total queries to four

A Better Solution

A more efficient way to handle this problem is to use a single SQL query to perform both updates in one go. We can achieve this using a WITH clause (common table expression, or CTE) and a single UPDATE query.

To learn more about CTEs, check out [this article(https://dev.to/chidioguejiofor/whats-with-cte-53ao)]

Here’s how we can do it:


WITH update_cte AS (
  SELECT
    event_id,
    COUNT(id) AS ticket_type_count,
    SUM(quantity) - SUM(quantity_sold) AS number_of_tickets_available
  FROM
    tickets
  GROUP BY
    event_id
)
UPDATE
  events
SET
  number_of_ticket_types = update_cte.ticket_type_count,
  number_of_tickets_available = update_cte.number_of_tickets_available
FROM
  update_cte
WHERE
  update_cte.event_id = events.id;

Enter fullscreen mode Exit fullscreen mode

Explanation

  1. The WITH update_cte clause calculates both the ticket_type_count and number_of_tickets_available in one go using COUNT and SUM functions, grouped by event_id.
  2. In the UPDATE statement, we join the events table with the CTE (update_cte) based on event_id.
  3. The fields in events (number_of_ticket_types and number_of_tickets_available) are updated with the values from the CTE.

Why Is This Better?

This approach performs both updates in a single query, eliminating the need for multiple database queries and significantly improving performance. Instead of managing two separate functions and multiple steps, we only need to maintain one query that efficiently backfills the data.

Additionally, the same concept can be applied to future backfills if similar aggregate fields are added to the events table.

Conclusion

Backfilling data after adding new fields to an existing table is a common task in application development. While the initial approach of making multiple queries works, it can be optimized significantly using SQL’s powerful aggregation and CTE features. By consolidating updates into a single query, we reduce database load, improve performance, and make the code more maintainable.

💖 💪 🙅 🚩
chidioguejiofor
Chidiebere Ogujeiofor

Posted on September 25, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related