Beware The Missing Foreign Key Index: A Postgres Performance Gotcha

jbranchaud

Josh Branchaud

Posted on May 27, 2021

Beware The Missing Foreign Key Index: A Postgres Performance Gotcha

I got absolutely thrashed by some PostgreSQL performance issues this past week. It was painful. And, it was incredibly instructive. I didn't just solve the problem, I now better understand Postgres. I'm better equipped to diagnose the next performance skirmish that comes my way. And I'm paying the lesson forward with this article.

Let me give you an idea of how bad it was. I had a query deleting 50k records on one table in ~100ms. That's fast enough for me. I then had a similar query deleting the same number of records from another table that would run for ~30 minutes. 10,000x slower! 😱 That's a nope.

After diagnosing and addressing the issue, I got that second delete query down to about 1 second. Much better! Toward the end of the post we'll see how to make it even faster than that.

But before I reconstruct a minimal example that reproduces the problem and get into the details, here is the...

tl;dr

Foreign keys are essential for enforcing the shape and integrity of our data. Indexes are there to keep queries fast. We can combine the two to start getting the most out of our database.

I don't tend to think of foreign key constraints as impacting performance. As we are about to see, they do.

Without that index, we're bound to run into some really sneaky performance gotchas.

Let's dig in.

A Minimal Example

This minimal example is also a real-world example. Most software systems have users and those users need to be given a variety of roles. These roles help the system determine the access and permissions of each user.

If you want to skip over the example setup, you can jump right to the details.

The Data Model

Alt Text

First is the users table. It doesn't feature strongly in the example so I'll leave it at the id column.

Then we have the roles table which can contain global roles like admin and support as well as roles tied to specific resources like team_owner and team_collaborator.

Creating a relationship between the two is the users_roles join table. A record in this table tying my user record to the admin role would tell the system that I am an admin. Another record in this table tying my user record to a team_collaborator role for the Red team would tell you that in addition to being an admin, I am a collaborator of the Red team.

Building The Schema

Here is some SQL that will generate our schema for reproducing the performance issue. The specific of how this works are out of scope for this post. If that interests you, let me know and I'll write a follow up.

The full queries are embedded in these code blocks, but you can also check out the code in this repo.



create table users (
  id bigint generated by default as identity primary key
);
create table roles (
  id bigint generated by default as identity primary key,
  name varchar not null,
  resource_type varchar,
  resource_id bigint
);
create table users_roles (
  user_id bigint references users,
  role_id bigint references roles
);


Enter fullscreen mode Exit fullscreen mode

I put this in a file called schema.sql. From a psql session I connect to an empty database where I can safely experiment. I then tell psql to execute the script which will create these three tables and their sequences.



> \i schema.sql
CREATE TABLE
Time: 2.605 ms
CREATE TABLE
Time: 2.615 ms
CREATE TABLE
Time: 1.141 ms
Time: 0.099 ms
> \d
               List of relations
 Schema |     Name     |   Type   |   Owner
--------+--------------+----------+------------
 public | roles        | table    | jbranchaud
 public | roles_id_seq | sequence | jbranchaud
 public | users        | table    | jbranchaud
 public | users_id_seq | sequence | jbranchaud
 public | users_roles  | table    | jbranchaud
(5 rows)


Enter fullscreen mode Exit fullscreen mode

Generating Some Data

Let's fill these tables with some data with help from the generate_series function. Again, I won't go into specifics here. Drop a note if you're interested in a follow up post on this part.



-- create 50,000 users
insert into users
select from generate_series(1,50000);

-- create 100,000 roles
insert into roles (
  name,
  resource_type,
  resource_id
)
select
  (array['admin', 'support', 'member'])[floor(random() * 3 + 1)],
  null,
  null
from generate_series(1,50000);

insert into roles (
  name,
  resource_type,
  resource_id
)
select
  'team_collaborator',
  'Team',
  floor(random() * 1000 + 1)
from generate_series(1,50000);

-- create 500,000 connections between users and roles
-- start with 225,000 random global roles
insert into users_roles (
  user_id,
  role_id
)
select
  floor(random() * 50000 + 1),
  floor(random() * 50000 + 1)
from generate_series(1,225000);

-- then 50,000 for the team collaborator role
insert into users_roles (
  user_id,
  role_id
)
select
  floor(random() * 50000 + 1),
  floor(50000 + g.id)
from generate_series(1,50000) as g(id);

-- then another 225,000 random global roles
insert into users_roles (
  user_id,
  role_id
)
select
  floor(random() * 50000 + 1),
  floor(random() * 50000 + 1)
from generate_series(1,225000);


Enter fullscreen mode Exit fullscreen mode

This creates 50,000 users and 100,000 roles. It then creates 500,000 connections between them in the users_roles join table, with 50,000 of those connection specifically for the team_collaborator role.

With all the data in place, let's find some specific data and delete it. This is where we'll see a reasonable query and a drastically slower query.

Fast Query, Slow Query

In the actual production data set I was dealing with, I had a bunch of data that I needed to clear out of the roles table. It turned out to be about 50k records. Because of the foreign key relationship, I first had to clear dependent records out of the users_roles table.

The following might feel a bit contrived, but it is based on that real scenario.

Here are the 50k roles that we are targeting for deletion.



> select count(*)
  from roles
  join users_roles
    on roles.id = users_roles.role_id
  where roles.name = 'team_collaborator';

 count
-------
 50000
(1 row)


Enter fullscreen mode Exit fullscreen mode

Why are we going to be deleting all the team_collaborator related data? Because that gives us about 50k rows. Like I said, a bit contrived. Nevertheless, we are going to learn some performance things from this.

The Quick Delete

Let's delete the dependent records from users_roles first.

Anytime I'm about to modify or delete data, I like to open a transaction. This makes it easy to safely rollback the changes if anything looks off. That is done with a begin.



> begin;

> delete from users_roles
  using roles
  where users_roles.role_id = roles.id
    and roles.name = 'team_collaborator';

DELETE 50000
Time: 116.313 ms


Enter fullscreen mode Exit fullscreen mode

That looks good. That query appears to have handled the 50k records from users_roles that I wanted deleted. Because I'm in a transaction, I can always dig into the data with a few select queries to be sure. And it was fast enough, clocking in at ~100ms in this instance.

I'm not ready to commit this transaction yet. Next I want to attempt to delete the roles records.

The Slow Delete

With all the foreign key dependencies taken care of, we are clear to get on with our main goal, deleting the roles records.

This query is even simpler than the previous. Typing it out, I was certain it would run just as quickly.



> delete from roles
  where roles.name = 'team_collaborator';

-- ... keep waiting ...


Enter fullscreen mode Exit fullscreen mode

For real though, if you are following along at home, don't hold your breath. This is going to take more or less 30 minutes depending on the specs of your machine.

What gives?

Can We Explain This?

This query isn't all that different than the previous one. It is targeting the same number of rows. What could account for the slow down?

Let's start by taking a look at the explain.



> explain delete from roles
  where roles.name = 'team_collaborator';

                            QUERY PLAN
------------------------------------------------------------------
 Delete on roles  (cost=0.00..1937.00 rows=49990 width=6)
   ->  Seq Scan on roles  (cost=0.00..1937.00 rows=49990 width=6)
         Filter: ((name)::text = 'team_collaborator'::text)
(3 rows)


Enter fullscreen mode Exit fullscreen mode

There isn't a lot going on in this one. The seq scan stands out to me. This means the query planner expects the query will have to look at every row sequentially. For a mere ~50k rows that doesn't seem like it should be an issue.

By contrast, here is the explain for the previous query.



> explain delete from users_roles
  using roles
  where users_roles.role_id = roles.id
    and roles.name = 'team_collaborator';

                                   QUERY PLAN
--------------------------------------------------------------------------------
 Delete on users_roles  (cost=2561.88..11577.43 rows=249950 width=12)
   ->  Hash Join  (cost=2561.88..11577.43 rows=249950 width=12)
         Hash Cond: (users_roles.role_id = roles.id)
         ->  Seq Scan on users_roles  (cost=0.00..7703.00 rows=500000 width=14)
         ->  Hash  (cost=1937.00..1937.00 rows=49990 width=14)
               ->  Seq Scan on roles  (cost=0.00..1937.00 rows=49990 width=14)
                     Filter: ((name)::text = 'team_collaborator'::text)
(7 rows)


Enter fullscreen mode Exit fullscreen mode

This one is more involved. I'm noticing that there is a seq scan of users_roles (500k rows) and nested under that is another seq scan of roles (~50k). In terms of predicted work, this one has a ton more to do, yet it finished in 100ms. Computers can be fast.

Despite that comparison, we know from running the queries that the first (simpler) one is going to take much longer than the second one. Some back of the napkin math says the simpler query plan for deleting from roles is looking to take 10,000x the amount of time of the second query 😱.

So that's the second scream emoji in this post—we've got to figure out what is going on.

What's the deal with this "simple" delete query?

The Mystery Uncovered



> delete from roles
  where roles.name = 'team_collaborator';


Enter fullscreen mode Exit fullscreen mode

The issue with this two-line, 50k row query doesn't have anything to do with the roles table itself. Instead, it has to do with the relationship that the roles table has to other tables.

The users_roles table depends on the roles table—specifically, on its id column. This dependency on roles.id is through the foreign key constraint on its own role_id column.

The foreign key is causing this

Kinda.

To demonstrate that the foreign key constraint is at the center of this performance issue, let's start by removing the foreign key constraint.



> begin;

> delete from users_roles
  using roles
  where users_roles.role_id = roles.id
    and roles.name = 'team_collaborator';

DELETE 50000
Time: 120.267 ms

> alter table users_roles
  drop constraint users_roles_role_id_fkey;

ALTER TABLE
Time: 0.561 ms

> delete from roles
  where roles.name = 'team_collaborator';

DELETE 50000
Time: 54.115 ms

> rollback;


Enter fullscreen mode Exit fullscreen mode

With the foreign key constraint out of the picture, the delete is quite speedy. Notice I immediately rolled these changes back. This was to highlight the impact of the constraint. This isn't the solution.

Keep the Foreign Key Constraint

Foreign key constraints are essential to solid schema design, and they are one of my favorite features of Postgres. They help ensure that the data relationships between two tables are always intact (that's called referential integrity). Because of this assurance, I never have to worry that data is going to be inadvertently orphaned.

How is the foreign key constraint having such an impact on performance?

For every single row you tell Postgres to delete from the roles table, it is going to first check with all dependent tables to make sure it is okay to do that. If the foreign key value isn't being used by any records in the associated table(s), then the delete can proceed. If even just one record depends on that value, the foreign key constraint is going to flag that and abort the delete action.

So, even though we preemptively deleted all the related data from users_roles, Postgres still must check at the moment of the delete if that is still true.

Postgres goes through the users_roles table and does a sequential scan checking every single record. For each roles record, Postgres is going to check 500,000 users_roles records.

And there is the root of the problem.

The Query Planner Needs More Info

Postgres is going to use the information at its disposal to verify that the foreign key constraint. As is, it doesn't have enough info to do anything better than look at every single row. If Postgres had an index though, it would have the info necessary to do a lot less work.

A database index is roughly like the organizational system that a library uses. Books are categorized and then they have a placement relative to other books in their same category. Using this system, you can quickly navigate to where a book should be.

Not having an index on users_roles.role_id is kinda like walking into a library without an organizational system, you have to start on one end and go book by book, perhaps through tens of thousands of books, until you find what you're looking for.

We can cut our delete query down to around 1 second by adding an index to users_roles.role_id. The index provides Postgres with the info it needs to quickly shortcut its way to any dependent rows.



create index role_id_index
  on users_roles (role_id);

CREATE INDEX
Time: 342.591 ms


Enter fullscreen mode Exit fullscreen mode

This adds an index to the users_roles.role_id column. It is a B-Tree index. "The PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison" (with operators like =, <, >, etc.).

With this index, we'll tend to get faster queries, especially with this delete. One tradeoff is that the index will take a little disk space. Since Postgres automatically updates the index on each write to the indexed column, that adds a teeny tiny bit of write time. Both of these tradeoffs are well worth it in the vast majority of cases.

Let's see how the index performs.

Deleting Faster

With that index in place, let's give both of our delete queries another try.



> begin;

> delete from users_roles
  using roles
  where users_roles.role_id = roles.id
    and roles.name = 'team_collaborator';

DELETE 50000
Time: 111.732 ms

> delete from roles
  where roles.name = 'team_collaborator';

DELETE 50000
Time: 590.668 ms


Enter fullscreen mode Exit fullscreen mode

In the blink of an eye it is done. The second delete query went from around 30 minutes to under 600ms. That's a massive difference. That's what indexes can do for us.

That brings me back to the tl;dr of this post.

Foreign keys are essential for enforcing the shape and integrity of our data. Indexes are there to keep queries fast. We can combine the two to start getting the most out of our database.

Without that index, you're bound to run into some really sneaky performance gotchas.

But can we go even faster?!

Deleting Even Faster

Since we are in a transaction and we are doing a ton of deletes, what if we tell Postgres to hold off on verifying the foreign key constraint until we're done with the deletes.



> begin;

> alter table users_roles
  alter constraint users_roles_role_id_fkey
    initially deferred;

ALTER TABLE
Time: 0.455 ms

> delete from users_roles
  using roles
  where users_roles.role_id = roles.id
    and roles.name = 'team_collaborator';

DELETE 50000
Time: 114.574 ms

> delete from roles
  where roles.name = 'team_collaborator';

DELETE 50000
Time: 68.433 ms


Enter fullscreen mode Exit fullscreen mode

Whoa, that made it another 10x faster, coming in at ~68ms.

Instead of checking the foreign key constraint 50,000 times, even with the index, it can check it once at the end. Zoom!

Wrapping Up

After running into this incredibly slow delete query, I went on quite a journey trying to understand what was causing the performance hit. There is a lot to be learned by being willing to really dig into a hard problem and try to understand, even if you can only understand it a little bit at a time. I came out the other end better for it. And I hope you are at the other end of this article better for having read it.

Foreign keys help ensure the integrity of your data. Indexes help keep your queries fast. Combine the two to really take advantage of the speed and power of PostgreSQL.

If you enjoy my writing, consider joining my newsletter or following me on twitter.


Notes

  1. An improvement to this schema would be to add an on delete cascade directive to the role_id foreign key constraint. This would mean we could go right to deleting the roles record and the corresponding users_roles records would be automatically removed. This would still suffer from the same performance issues.

  2. Postgres allows you to add indexes concurrently to avoid locking writes to the entire table. Though adding an index is a pretty quick action, it is a good rule of thumb to add them concurrently. This is especially true when dealing with live tables in a production environment. I chose not to add our index concurrently in this article to avoid tossing one more concept into the mix.

Acknowledgements

A big thanks to several people who read through and provided feedback on earlier drafts of this post. They are awesome. Check out their stuff.


Cover photo by Ryan Johnston on Unsplash

💖 💪 🙅 🚩
jbranchaud
Josh Branchaud

Posted on May 27, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related