PostgreSQL Semi Join, Unique-ify RHS, Materialized CTEs, and Rows Estimation
Franck Pachot
Posted on October 6, 2024
As a developer, I prefer using a WITH clause and declaring Common Table Expressions (CTE) instead of nesting subqueries for two reasons:
- It helps structure the intermediate steps of a complex query, making it more readable and accessible to troubleshoot.
- It avoids scanning the same table multiple times without creating a temporary table.
Since PostgreSQL 12, the query planner has been able to inline the CTE (Common Table Expression) query block into the main query block, allowing for more optimization opportunities. This behavior can be controlled by adding MATERIALIZE
or NOT MATERIALIZED
clauses, which are non-standard SQL additions used to achieve what other SQL databases do with optimizer hints. Materialization is sometimes chosen by default to avoid executing the query block multiple times.
Materialized Common Table Expressions (CTEs) in PostgreSQL do not have column optimizer statistics. As a result, the query planner uses default values to estimate the number of distinct rows. Furthermore, a semi-join may be transformed into a regular join by applying a DISTINCT operator to the inner table. This combination can lead to inaccuracies in estimating cardinalities because it is impossible to estimate distinct values without column statistics.
In PostgreSQL 17, a new feature exposes the column statistics of the CTE so that the main block gets better estimations. However, writing the SQL query with subqueries rather than a CTE may allow better optimization. Here is a full example reproducible with PgBench on any PostgreSQL-compatible database.
I ran an example on YugabyteDB 2.23 (which is PG11 compatible) that I have started on a docker container:
docker run --rm -it yugabytedb/yugabyte:2.23.0.0-b710 bash -c '
yugabyted start --enable_pg_parity_tech_preview
yum install -y postgresql-contrib # install psql and pgbench
PGHOST=$(hostname) PGPORT=5433 PGUSER=yugabyte psql
'
I create one million PgBench accounts and run a PgBench simple-update workload on a small set.
-- create one million accounts
\! PGOPTIONS="-c client_min_messages=error " pgbench -is 10
-- run one thousand transactions to leave many accounts inactive
\! pgbench -t 999 -nN
analyze;
I have numerous rows in the "pgbench_accounts" table and only a few in the "pgbench_history" table. I execute a query to analyze accounts with a non-negative low balance (from pgbench_accounts where balance is between 0 and 100
) and provide both the count of such accounts and the range of their activity times (min(activity time) and max(activity time) from pgbench_history
).
Here is my query. I'm using a Common Table Expression to find the eligible account identifiers, then reading it to get the count and querying the history.
with eligible as (
-- eligible accounts: positive low balance
select aid from pgbench_accounts
where abalance between 0 and 100
)
select
(select count(*) from eligible), min(mtime), max(mtime)
from pgbench_history
where aid in ( select aid from eligible )
;
This takes one second:
count | min | max
--------+----------------------------+----------------------------
999011 | 2024-10-05 14:52:12.086121 | 2024-10-05 14:53:23.088943
(1 row)
Time: 1394.241 ms (00:01.394)
Here is the explain analyze
:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1396692.21..1396692.22 rows=1 width=24) (actual time=1607.365..1607.365 rows=1 loops=1)
CTE eligible
-> Seq Scan on pgbench_accounts (cost=180.00..1349601.05 rows=999138 width=4) (actual time=1.482..209.542 rows=999011 loops=1)
Storage Filter: ((abalance >= 0) AND (abalance <= 100))
InitPlan 2 (returns $1)
-> Aggregate (cost=22480.61..22480.62 rows=1 width=8) (actual time=190.502..190.502 rows=1 loops=1)
-> CTE Scan on eligible eligible_1 (cost=0.00..19982.76 rows=999138 width=0) (actual time=0.014..91.441 rows=999011 loops=1)
-> Hash Join (cost=22665.11..24608.04 rows=500 width=8) (actual time=1336.955..1416.830 rows=10 loops=1)
Hash Cond: (pgbench_history.aid = eligible.aid)
-> Seq Scan on pgbench_history (cost=180.00..2114.75 rows=999 width=12) (actual time=0.688..1.381 rows=999 loops=1)
-> Hash (cost=22482.61..22482.61 rows=200 width=4) (actual time=1320.091..1320.091 rows=999011 loops=1)
Buckets: 131072 (originally 1024) Batches: 16 (originally 1) Memory Usage: 3224kB
-> HashAggregate (cost=22480.61..22482.61 rows=200 width=4) (actual time=914.336..1128.420 rows=999011 loops=1)
Group Key: eligible.aid
-> CTE Scan on eligible (cost=0.00..19982.76 rows=999138 width=4) (actual time=1.486..516.342 rows=999011 loops=1)
Planning Time: 0.169 ms
Execution Time: 1609.045 ms
Peak Memory Usage: 136403 kB
Out of 1607.365
ms, 209.542
ms was spent on scanning eligible accounts, and 1128.420
ms was used to HashAggregate
them.
Semi Join transformed to Join with Group Key
The HashAggregate
on Group Key:
results from two query planner transformations.
The first transformation is straightforward, converting IN (SELECT )
to a Semi-Join, which is expressed in SQL by IN(), =ANY(), or EXISTS. A SemiJoin is similar to a Join, but it stops the inner loop when one row matches because it only needs to know the existence of a matching row, not all.
The second transformation involves using a regular Join instead of a Semi-Join by removing duplicate rows from the inner table with the equivalent of a DISTINCT. This transformation is referenced as unique-ify the RHS (Right Hand Side) in the PostgreSQL code. This is visible in the execution plan:
-> HashAggregate (cost=22480.61..22482.61 rows=200 width=4) (actual time=914.336..1128.420 rows=999011 loops=1)
Group Key: eligible.aid
-> CTE Scan on eligible (cost=0.00..19982.76 rows=999138 width=4) (actual time=1.486..516.342 rows=999011 loops=1)
The efficiency of transforming a semi-join like this depends on the number of unique keys. However, because the table comes from a CTE, the query planner cannot estimate it accurately. It uses an underestimated default 'rows=200'.
This underestimation also led to the wrong join order. The hash table should be the smaller of the two tables being joined, but in this case, the larger one was used because it was estimated to have fewer rows.
Not Materialized CTE (PostgreSQL 12)
PostgreSQL 12 added the possibility to force the transformation of a CTE to an inline subquery by altering the WITH syntax in PG12 and adding not materialized
:
postgres=# select version();
version
---------------------------------------------------------------------------------------------------------------------------
PostgreSQL 16.4 (Debian 16.4-1.pgdg120+2) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)
postgres=# explain
with eligible as NOT MATERIALIZED (
-- eligible accounts: positive low balance
select aid from pgbench_accounts
where abalance between 0 and 100
)
select
(select count(*) from eligible), min(mtime), max(mtime)
from pgbench_history
where aid in ( select aid from eligible )
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=32433.82..32433.83 rows=1 width=24) (actual time=83.037..84.468 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Finalize Aggregate (cost=24700.78..24700.79 rows=1 width=8) (actual time=73.612..75.041 rows=1 loops=1)
-> Gather (cost=24700.56..24700.77 rows=2 width=8) (actual time=73.488..75.035 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=23700.56..23700.57 rows=1 width=8) (actual time=69.476..69.477 rows=1 loops=3)
-> Parallel Seq Scan on pgbench_accounts pgbench_accounts_1 (cost=0.00..22660.00 rows=416226 width=0) (actual time=0.046..51.041 rows=333004 loops=3)
Filter: ((abalance >= 0) AND (abalance <= 100))
Rows Removed by Filter: 329
-> Nested Loop (cost=0.42..7728.04 rows=998 width=8) (actual time=0.365..9.414 rows=11 loops=1)
-> Seq Scan on pgbench_history (cost=0.00..16.99 rows=999 width=12) (actual time=0.008..0.101 rows=999 loops=1)
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.42..7.72 rows=1 width=4) (actual time=0.009..0.009 rows=0 loops=999)
Index Cond: (aid = pgbench_history.aid)
Filter: ((abalance >= 0) AND (abalance <= 100))
Rows Removed by Filter: 1
Planning Time: 0.204 ms
Execution Time: 84.505 ms
(18 rows)
This allows for better estimation of cardinalities based on column statistics. Additionally, PostgreSQL 16 can push down the join condition and use a Nested Loop Join by inlining the common table expression (CTE).
The CTE scans the table multiple times without materializing, but each scan can be optimized: Parallel Seq Scan for the count, Index Scan and Neste Loop for the IN() subquery.
Hash Semi Join (PostgreSQL 17)
I tested the same in PostgreSQL 17, recently released, with the default materialized CTE, and it shows a Hash Semi Join:
postgres=# select version();
version
---------------------------------------------------------------------------------------------------------------------------
PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)
postgres=# explain analyze
with eligible as materialized (
-- eligible accounts: positive low balance
select aid from pgbench_accounts
where abalance between 0 and 100
)
select
(select count(*) from eligible), min(mtime), max(mtime)
from pgbench_history
where aid in ( select aid from eligible )
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=94230.95..94230.96 rows=1 width=24) (actual time=643.668..643.672 rows=1 loops=1)
CTE eligible
-> Seq Scan on pgbench_accounts (cost=0.00..31410.00 rows=999404 width=4) (actual time=0.056..133.160 rows=999009 loops=1)
Filter: ((abalance >= 0) AND (abalance <= 100))
Rows Removed by Filter: 991
InitPlan 2
-> Aggregate (cost=22486.59..22486.60 rows=1 width=8) (actual time=137.553..137.553 rows=1 loops=1)
-> CTE Scan on eligible eligible_1 (cost=0.00..19988.08 rows=999404 width=0) (actual time=0.008..88.327 rows=999009 loops=1)
-> Hash Semi Join (cost=36384.63..40329.36 rows=999 width=8) (actual time=454.135..506.103 rows=8 loops=1)
Hash Cond: (pgbench_history.aid = eligible.aid)
-> Seq Scan on pgbench_history (cost=0.00..16.99 rows=999 width=12) (actual time=0.015..0.131 rows=999 loops=1)
-> Hash (cost=19988.08..19988.08 rows=999404 width=4) (actual time=432.305..432.306 rows=999009 loops=1)
Buckets: 262144 Batches: 8 Memory Usage: 6441kB
-> CTE Scan on eligible (cost=0.00..19988.08 rows=999404 width=4) (actual time=0.058..323.075 rows=999009 loops=1)
Planning Time: 0.137 ms
Execution Time: 644.042 ms
(16 rows)
In PostgreSQL 17, there is an enhancement that improves the estimation of the number of distinct rows by obtaining it from the Common Table Expression (CTE) sub-plan. As a result, it avoids the unique-ify of the inner table and uses a Semi-Join. Even with accurate estimations, the large table needs to be the inner table (the right-hand side of the semi-join) and build the hash table.
Rewrite SQL With Inlined subqueries
Back to YugabyteDB, here is the query re-written with inlined subqueries instead of CTE:
yugabyte=# explain (analyze)
select (
-- eligible accounts: positive low balance
select count(*) from pgbench_accounts
where abalance between 0 and 100
) as eligible_count, min(mtime), max(mtime)
from pgbench_history
where aid in (
-- eligible accounts: positive low balance
select aid from pgbench_accounts
where abalance between 0 and 100
);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1682118.96..1682118.97 rows=1 width=24) (actual time=224.784..224.784 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Finalize Aggregate (cost=1657098.90..1657098.91 rows=1 width=8) (actual time=216.013..216.013 rows=1 loops=1)
-> Seq Scan on pgbench_accounts pgbench_accounts_1 (cost=180.00..1654601.05 rows=999138 width=0) (actual time=216.000..216.003 rows=3 loops=1)
Storage Filter: ((abalance >= 0) AND (abalance <= 100))
Partial Aggregate: true
-> YB Batched Nested Loop Join (cost=360.00..25015.07 rows=998 width=8) (actual time=8.739..8.760 rows=10 loops=1)
Join Filter: (pgbench_history.aid = pgbench_accounts.aid)
-> Seq Scan on pgbench_history (cost=180.00..2114.75 rows=999 width=12) (actual time=0.820..1.596 rows=999 loops=1)
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=180.00..10437.79 rows=998 width=4) (actual time=6.707..6.712 rows=10 loops=1)
Index Cond: (aid = ANY (ARRAY[pgbench_history.aid, $2, $3, ..., $1024]))
Storage Filter: ((abalance >= 0) AND (abalance <= 100))
Planning Time: 0.552 ms
Execution Time: 225.035 ms
Peak Memory Usage: 755 kB
The rewritten SQL is similar to PostgreSQL's non-materialized common table expression (CTE), enabling a Nested Loop Join from the smaller table.
YugabyteDB further optimizes this with Batched Nested Loop Join, which pushes down the join filter to an array, thus avoiding excessive loops. This works because YugabyteDB improved the Index Scan with arrays (more about this, and another PostgreSQL 17 enhancement, in a previous post).
Conclusion
It's essential to use EXPLAIN ANALYZE to understand the query performance. This lets you pinpoint the longest operation and compare estimated rows with actual rows to understand the query planner's decisions. Sometimes, you may need to rewrite the query to enable more access paths and join methods.
Posted on October 6, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
October 6, 2024