Postgres partitioning performance: Hash vs. List

msdousti

Sadeq Dousti

Posted on May 1, 2024

Postgres partitioning performance: Hash vs. List

In our design, we came up with a would-be large PostgreSQL table that just stores IDs of incoming (Kafka) events, for the purpose of de-duplication. The IDs are of type UUID.

create table t(id uuid primary key);
Enter fullscreen mode Exit fullscreen mode

After some consideration, we decided to partition this table into 16 partitions.

Before we continue, let's make it clear that this article is about a very specific case and workload. However, it should provide you with enough insight to customize it based on your own needs.

Hash partitioning

The initial idea was to use hash partitioning.

drop table if exists t;

create table t(id uuid primary key)
    partition by hash(id);

create table t_00
    partition of t 
    for values with (modulus 16, remainder 0);

create table t_01
    partition of t 
    for values with (modulus 16, remainder 1);

create table t_02
    partition of t 
    for values with (modulus 16, remainder 2);

create table t_03
    partition of t 
    for values with (modulus 16, remainder 3);

create table t_04
    partition of t 
    for values with (modulus 16, remainder 4);

create table t_05
    partition of t 
    for values with (modulus 16, remainder 5);

create table t_06
    partition of t 
    for values with (modulus 16, remainder 6);

create table t_07
    partition of t 
    for values with (modulus 16, remainder 7);

create table t_08
    partition of t 
    for values with (modulus 16, remainder 8);

create table t_09
    partition of t 
    for values with (modulus 16, remainder 9);

create table t_10
    partition of t 
    for values with (modulus 16, remainder 10);

create table t_11
    partition of t 
    for values with (modulus 16, remainder 11);

create table t_12
    partition of t 
    for values with (modulus 16, remainder 12);

create table t_13
    partition of t 
    for values with (modulus 16, remainder 13);

create table t_14
    partition of t 
    for values with (modulus 16, remainder 14);

create table t_15
    partition of t 
    for values with (modulus 16, remainder 15);
Enter fullscreen mode Exit fullscreen mode

A competing idea was to ditch hash and use the last digit of the id as partitioning key. I'll discuss the idea in the next section, but let's first benchmark the hash partitioning approach:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF
Enter fullscreen mode Exit fullscreen mode

I chose to run 10 connections to DB, each sending 900 queries to the DB, using 30 concurrent threads. Each query will insert 1000 UUIDs in the table.

Why these numbers? Just for fun. They should conform to the real traffic to be indicative of anything. But let's just see how this turns out on my personal laptop (old, 2017 model!):

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 13.776 ms
initial connection time = 6.491 ms
tps = 725.901931 (without initial connection time)

real    0m12.438s
user    0m0.162s
sys     0m0.396s
Enter fullscreen mode Exit fullscreen mode

It took 12.4 seconds to insert 9,000,000 rows. The average TPS (transaction per second) is 725.9.

Using psql metacommands, we can see the table/index sizes:

  • Using \dt+ to see table sizes (some columns are removed for brevity):
┌──────┬───────────────────┬─────────┐
│ Name │       Type        │  Size   │
├──────┼───────────────────┼─────────┤
│ t    │ partitioned table │ 0 bytes │
│ t_00 │ table             │ 24 MB   │
│ t_01 │ table             │ 24 MB   │
│ t_02 │ table             │ 24 MB   │
│ t_03 │ table             │ 24 MB   │
│ t_04 │ table             │ 24 MB   │
│ t_05 │ table             │ 24 MB   │
│ t_06 │ table             │ 24 MB   │
│ t_07 │ table             │ 24 MB   │
│ t_08 │ table             │ 24 MB   │
│ t_09 │ table             │ 24 MB   │
│ t_10 │ table             │ 24 MB   │
│ t_11 │ table             │ 24 MB   │
│ t_12 │ table             │ 24 MB   │
│ t_13 │ table             │ 24 MB   │
│ t_14 │ table             │ 24 MB   │
│ t_15 │ table             │ 24 MB   │
└──────┴───────────────────┴─────────┘
Enter fullscreen mode Exit fullscreen mode
  • Using \di+ to see index sizes (some columns are removed for brevity):
┌───────────┬───────────────────┬─────────┐
│   Name    │       Type        │  Size   │
├───────────┼───────────────────┼─────────┤
│ t_pkey    │ partitioned index │ 0 bytes │
│ t_00_pkey │ index             │ 21 MB   │
│ t_01_pkey │ index             │ 21 MB   │
│ t_02_pkey │ index             │ 22 MB   │
│ t_03_pkey │ index             │ 20 MB   │
│ t_04_pkey │ index             │ 21 MB   │
│ t_05_pkey │ index             │ 21 MB   │
│ t_06_pkey │ index             │ 21 MB   │
│ t_07_pkey │ index             │ 20 MB   │
│ t_08_pkey │ index             │ 20 MB   │
│ t_09_pkey │ index             │ 21 MB   │
│ t_10_pkey │ index             │ 21 MB   │
│ t_11_pkey │ index             │ 21 MB   │
│ t_12_pkey │ index             │ 21 MB   │
│ t_13_pkey │ index             │ 21 MB   │
│ t_14_pkey │ index             │ 21 MB   │
│ t_15_pkey │ index             │ 21 MB   │
└───────────┴───────────────────┴─────────┘
Enter fullscreen mode Exit fullscreen mode

Notice the indexes are almost as large as the tables themselves. Also while the data is equally distributed among partitions (each 24 MB), the index sizes range from 20 to 22 MB. The total size for indexes is 334 MB.

List partitioning

If we want to use the right digit of the id as the partitioning key, the primary key cannot be added to the parent table:

create table t(id uuid primary key)
partition by list (lower(left(id::text, 1)));
Enter fullscreen mode Exit fullscreen mode

results in error:

ERROR:  unsupported PRIMARY KEY constraint with partition key definition
DETAIL:  PRIMARY KEY constraints cannot be used when partition keys include expressions.
Enter fullscreen mode Exit fullscreen mode

So, we decided to add the primary key to each individual partition (this effectively enforces uniqueness across all data):

drop table if exists t;

create table t(id uuid not null)
partition by list (lower(left(id::text, 1)));

create table t_00
partition of t
(id primary key)
for values in ('0');

create table t_01
partition of t
(id primary key) 
for values in ('1');

create table t_02
partition of t
(id primary key) 
for values in ('2');

create table t_03
partition of t
(id primary key) 
for values in ('3');

create table t_04
partition of t
(id primary key) 
for values in ('4');

create table t_05
partition of t
(id primary key) 
for values in ('5');

create table t_06
partition of t
(id primary key) 
for values in ('6');

create table t_07
partition of t
(id primary key) 
for values in ('7');

create table t_08
partition of t
(id primary key) 
for values in ('8');

create table t_09
partition of t
(id primary key) 
for values in ('9');

create table t_10
partition of t
(id primary key) 
for values in ('a');

create table t_11
partition of t
(id primary key) 
for values in ('b');

create table t_12
partition of t
(id primary key) 
for values in ('c');

create table t_13
partition of t
(id primary key) 
for values in ('d');

create table t_14
partition of t
(id primary key) 
for values in ('e');

create table t_15
partition of t
(id primary key) 
for values in ('f');
Enter fullscreen mode Exit fullscreen mode

Now let's benchmark again:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF
Enter fullscreen mode Exit fullscreen mode

result:

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 15.123 ms
initial connection time = 8.810 ms
tps = 661.264382 (without initial connection time)

real    0m13.654s
user    0m0.150s
sys     0m0.409s
Enter fullscreen mode Exit fullscreen mode

This is slower than the hash partition:

  • Duration: 13.654s instead of 12.438s
  • TPS: 661.264382 instead of 725.901931

So, we lose the primary key, and it's even slower! Hash partitioning is a clear winner here.

Using \dt+ and \di+ results in almost identical results, so let's not repeat myself.

Using hash indexes instead of b-tree indexes

Another suggested approach is to enforce uniqueness using hash indexes instead of b-tree indexes. The benefit is that they are often smaller and faster than b-tree indexes in case equality checking is the only important operation.

Postgres primary keys do not yet support this, but we may use a hack by applying a non-equality constraint based on hashes:

drop table if exists t;

create table t(id uuid not null)
partition by list (left(id::text, 1));

create table t_00
partition of t
(exclude using hash(id with =))
for values in ('0');

create table t_01
partition of t
(exclude using hash(id with =)) 
for values in ('1');

create table t_02
partition of t
(exclude using hash(id with =)) 
for values in ('2');

create table t_03
partition of t
(exclude using hash(id with =)) 
for values in ('3');

create table t_04
partition of t
(exclude using hash(id with =)) 
for values in ('4');

create table t_05
partition of t
(exclude using hash(id with =)) 
for values in ('5');

create table t_06
partition of t
(exclude using hash(id with =)) 
for values in ('6');

create table t_07
partition of t
(exclude using hash(id with =)) 
for values in ('7');

create table t_08
partition of t
(exclude using hash(id with =)) 
for values in ('8');

create table t_09
partition of t
(exclude using hash(id with =)) 
for values in ('9');

create table t_10
partition of t
(exclude using hash(id with =)) 
for values in ('a');

create table t_11
partition of t
(exclude using hash(id with =)) 
for values in ('b');

create table t_12
partition of t
(exclude using hash(id with =)) 
for values in ('c');

create table t_13
partition of t
(exclude using hash(id with =)) 
for values in ('d');

create table t_14
partition of t
(exclude using hash(id with =)) 
for values in ('e');

create table t_15
partition of t
(exclude using hash(id with =)) 
for values in ('f');
Enter fullscreen mode Exit fullscreen mode

Let's benchmark this as well:

time \
pgbench -c10 -t 900 -j30 -n -f - << EOF
insert into t
select gen_random_uuid()
from generate_series(1, 1000);
EOF
Enter fullscreen mode Exit fullscreen mode

result:

pgbench (16.2 (Ubuntu 16.2-1ubuntu4))
transaction type: -
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 10
maximum number of tries: 1
number of transactions per client: 900
number of transactions actually processed: 9000/9000
number of failed transactions: 0 (0.000%)
latency average = 16.686 ms
initial connection time = 7.089 ms
tps = 599.314265 (without initial connection time)

real    0m15.067s
user    0m0.127s
sys     0m0.468s
Enter fullscreen mode Exit fullscreen mode

Well, I didn't expect that. It's even slower now. Looking at the table sizes (\dt+), they are the same as before (24MB).

However, index sizes (\di+) are a tiny bit smaller:

┌──────────────┬───────┐
│     Name     │ Size  │
├──────────────┼───────┤
│ t_00_id_excl │ 20 MB │
│ t_01_id_excl │ 20 MB │
│ t_02_id_excl │ 20 MB │
│ t_03_id_excl │ 20 MB │
│ t_04_id_excl │ 20 MB │
│ t_05_id_excl │ 20 MB │
│ t_06_id_excl │ 20 MB │
│ t_07_id_excl │ 20 MB │
│ t_08_id_excl │ 20 MB │
│ t_09_id_excl │ 20 MB │
│ t_10_id_excl │ 20 MB │
│ t_11_id_excl │ 20 MB │
│ t_12_id_excl │ 20 MB │
│ t_13_id_excl │ 20 MB │
│ t_14_id_excl │ 20 MB │
│ t_15_id_excl │ 20 MB │
└──────────────┴───────┘
Enter fullscreen mode Exit fullscreen mode

So, in total, the size of the index was reduced from 334 MB to 320 MB.

Summary

  • Hash partitioning outperforms list partitioning in the above example (pay attention: not always)
  • Hash partitioning has the added benefit that all tables have primary keys (again, specific to the above example). This is important when using logical replication. For instance, to use AWS RDS blue/green deployment:

    Make sure that all tables in the DB instance have a primary key. PostgreSQL logical replication doesn't allow UPDATE or DELETE operations on tables that don't have a primary key.

  • Using hash indexes instead of b-tree was not a performance boost, but reduced index size by less than 5%.

Edits

Edit (2024-05-03)

A colleague of mine, who asked to remain anonymous, explained the reason why list-partitioning was slower:

In your example, to compute the partition key of the list-based approach, you use a cast (cast UUID to text), then two functions are applied (LEFT and LOWER). The functions should be pretty quick, but the cast is slow. That’s why the combined effect is slower than the hash function, which is implemented in C and is quite fast.

Another colleague, Tim, gave a nice summary:

So, if I got it right, in essence it says “Don’t try to be fancy, just do it in a boring way and PostgreSQL will deal with it in an optimized way.”

Which reminds me of a story of when we implemented a variant of strlen() function and observed it’s slower than the GLIBC library function by a factor of 300 times! I should write a post on that too 🙂

Further reading

For more info on hash indexes, look at this Twitter thread and the links thereof.

💖 💪 🙅 🚩
msdousti
Sadeq Dousti

Posted on May 1, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related