Postgres Join Methods in YugabyteDB
Franck Pachot
Posted on September 11, 2023
YugabyteDB leverages PostgreSQL to handle SQL queries, including Joins. Nevertheless, it's worth noting that PostgreSQL's join methods are primarily designed for efficient access to table rows stored locally and in the single-node shared buffers. YugabyteDB doesn't reinvent the wheel with new join methods. Instead, it optimizes performance by pushing down predicates to the storage layer and minimizing the amount of data transferred between database nodes.
I create three tables with regions, cities and countries, from gvenzl/sample-data
repo:
\! curl -s \! curl -s https://raw.githubusercontent.com/gvenzl/sample-data/main/countries-cities-currencies/install.sql | awk '/D a t a l o a d/{print "do $$ begin"}/ COMMIT /{print "end; $$ ;"}{print}' >> countries.sql
\i countries.sql
analyze regions, countries, cities;
I'm running this on YugabyteDB 2.19 where Cost Based Optimizer and Batched Nested Loops are not yet enabled by default. I'll run the following query that lists the European cities with more than one million inhabitants inhabitants:
select cities.population
,regions.name as region_name
, countries.name as county_name
, cities.name as city_name
from regions
join countries using (region_id)
join cities using (country_id)
where regions.name='Europe'
and cities.population > 1e6
order by population desc
;
I'll add hints to force the join methods so that I can explain all of them. Note that without forcing anything, but enabling the Cost Based Optimizer, the YugabyteDB query planner comes with Nested Loops:
yugabyte=# explain (costs off, analyze, dist)
/*+
set( yb_enable_optimizer_statistics on )
*/
select cities.population ,regions.name as region_name , countries.name as county_name , cities.name as city_name
from regions join countries using (region_id) join cities using (country_id)
where regions.name='Europe' and cities.population > 1e6
order by population desc ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Sort (actual time=45.449..45.451 rows=24 loops=1)
Sort Key: cities.population DESC
Sort Method: quicksort Memory: 26kB
-> Nested Loop (actual time=5.225..45.403 rows=24 loops=1)
-> Nested Loop (actual time=3.048..3.369 rows=46 loops=1)
-> Seq Scan on regions (actual time=1.336..1.338 rows=1 loops=1)
Remote Filter: ((name)::text = 'Europe'::text)
Storage Table Read Requests: 1
Storage Table Execution Time: 0.000 ms
-> Index Scan using countries_regions_fk001 on countries (actual time=1.704..1.984 rows=46 loops=1)
Index Cond: ((region_id)::text = (regions.region_id)::text)
Storage Index Read Requests: 1
Storage Index Execution Time: 0.000 ms
Storage Table Read Requests: 1
Storage Table Execution Time: 4.000 ms
-> Index Scan using cities_countries_fk001 on cities (actual time=0.882..0.883 rows=1 loops=46)
Index Cond: ((country_id)::text = (countries.country_id)::text)
Remote Filter: (population > '1000000'::numeric)
Storage Index Read Requests: 1
Storage Index Execution Time: 0.261 ms
Storage Table Read Requests: 1
Storage Table Execution Time: 0.609 ms
Planning Time: 0.299 ms
Execution Time: 45.560 ms
Storage Read Requests: 95
Storage Write Requests: 0
Storage Execution Time: 44.000 ms
Peak Memory Usage: 253 kB
(28 rows)
I'll revisit this later because the Nested Loop join is unique. It can push down the join predicate, offering a significant advantage for combining join conditions with filter conditions. However, it also comes with challenges since the outer loop may introduce some latency.
Hash Join
yugabyte=# explain (costs off, analyze, dist)
/*+
leading ( ( (regions countries) cities ) )
hashjoin( regions countries )
hashjoin( regions countries cities )
set( yb_enable_optimizer_statistics on )
*/
select cities.population ,regions.name as region_name , countries.name as county_name , cities.name as city_name
from regions join countries using (region_id) join cities using (country_id)
where regions.name='Europe' and cities.population > 1e6
order by population desc ;
QUERY PLAN
-------------------------------------------------------------------------------------------
Sort (actual time=3.995..3.997 rows=24 loops=1)
Sort Key: cities.population DESC
Sort Method: quicksort Memory: 26kB
-> Hash Join (actual time=3.959..3.977 rows=24 loops=1)
Hash Cond: ((countries.country_id)::text = (cities.country_id)::text)
-> Hash Join (actual time=2.729..2.737 rows=46 loops=1)
Hash Cond: ((regions.region_id)::text = (countries.region_id)::text)
-> Seq Scan on regions (actual time=0.811..0.811 rows=1 loops=1)
Remote Filter: ((name)::text = 'Europe'::text)
Storage Table Read Requests: 1
Storage Table Execution Time: 0.000 ms
-> Hash (actual time=1.900..1.900 rows=196 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 18kB
-> Seq Scan on countries (actual time=0.971..1.851 rows=196 loops=1)
Storage Table Read Requests: 3
Storage Table Execution Time: 4.000 ms
-> Hash (actual time=1.221..1.221 rows=117 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 14kB
-> Seq Scan on cities (actual time=1.086..1.166 rows=117 loops=1)
Remote Filter: (population > '1000000'::numeric)
Storage Table Read Requests: 1
Storage Table Execution Time: 0.000 ms
Planning Time: 0.290 ms
Execution Time: 4.077 ms
Storage Read Requests: 5
Storage Write Requests: 0
Storage Execution Time: 4.000 ms
Peak Memory Usage: 241 kB
(28 rows)
The Hash Join reads all tables once. It can only apply single-table predicates, which are seen as Remote Filter
here since they've been pushed down to the storage layer. The join condition is applied afterward by the session process (the PostgreSQL backend) and is visible as Hash Cond
. To apply the join condition, one table per join undergoes a hashing operation, enabling rapid lookup while reading the other table. In the execution plan and the pg_hint_plan leading()
hint, the hashed table typically appears as the second child of the Hash Join operation. This join method is particularly efficient when joining a large table with a lookup table because we likely need most of the rows from the lookup table and retrieving them all at once makes sense. It's most efficient when the lookup table is small enough to fit in memory without the need to spill to temporary files.
In this example, we observe that we've read all countries (rows=196
) and all cities within the condition on population from all regions (rows=117
). These data sets were subsequently transmitted over the network for later reduction (rows=46
) within the PostgreSQL backend.
Merge Join
yugabyte=# explain (costs off, analyze, dist)
/*+
leading ( ( (regions countries) cities ) )
mergejoin( regions countries )
mergejoin( regions countries cities )
set( yb_enable_optimizer_statistics on )
*/
select cities.population ,regions.name as region_name , countries.name as county_name , cities.name as city_name
from regions join countries using (region_id) join cities using (country_id)
where regions.name='Europe' and cities.population > 1e6
order by population desc ;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Sort (actual time=25.990..25.991 rows=24 loops=1)
Sort Key: cities.population DESC
Sort Method: quicksort Memory: 26kB
-> Merge Join (actual time=25.797..25.827 rows=24 loops=1)
Merge Cond: ((countries.country_id)::text = (cities.country_id)::text)
-> Sort (actual time=21.084..21.089 rows=46 loops=1)
Sort Key: countries.country_id
Sort Method: quicksort Memory: 28kB
-> Merge Join (actual time=21.054..21.064 rows=46 loops=1)
Merge Cond: ((regions.region_id)::text = (countries.region_id)::text)
-> Sort (actual time=18.525..18.526 rows=1 loops=1)
Sort Key: regions.region_id
Sort Method: quicksort Memory: 25kB
-> Seq Scan on regions (actual time=18.046..18.049 rows=1 loops=1)
Remote Filter: ((name)::text = 'Europe'::text)
Storage Table Read Requests: 1
Storage Table Execution Time: 8.000 ms
-> Sort (actual time=2.341..2.348 rows=148 loops=1)
Sort Key: countries.region_id
Sort Method: quicksort Memory: 36kB
-> Seq Scan on countries (actual time=1.143..2.192 rows=196 loops=1)
Storage Table Read Requests: 3
Storage Table Execution Time: 4.000 ms
-> Sort (actual time=4.706..4.711 rows=113 loops=1)
Sort Key: cities.country_id
Sort Method: quicksort Memory: 33kB
-> Seq Scan on cities (actual time=4.567..4.643 rows=117 loops=1)
Remote Filter: (population > '1000000'::numeric)
Storage Table Read Requests: 1
Storage Table Execution Time: 4.000 ms
Planning Time: 8.515 ms
Execution Time: 31.152 ms
Storage Read Requests: 5
Storage Write Requests: 0
Storage Execution Time: 16.000 ms
Peak Memory Usage: 282 kB
(36 rows)
A Merge Join similar to a Hash Join in that it involves reading all tables with only their single-table predicates applied. With both methods, the same rows are retrieved from the distributed storage and sent to the PostgreSQL backend. However, the key distinction lies in the structure used to join them. Instead of hashing one of the tables, both data sources are sorted, allowing them to be joined as they are read. This operation necessitates a memory work area for both tables, which makes it less efficient than the hash join in most scenarios. Nonetheless, there are specific situations where the Merge Join is employed.
When the join condition involves an inequality rather than an equality predicate, a hash structure cannot be employed, but a sort merge becomes a viable option.
If one or both tables are already sorted on the join key, potentially due to an ascending or descending index scan, there is no need to perform additional sorting. Joining from two pre-sorted sources is more efficient than constructing a hash table on one of them.
In cases where the query necessitates the result to be sorted based on the join key due to an ORDER BY clause, it can be efficient to perform the sorting operation on the rows to be joined. This approach allows you to benefit from a Merge Join, as the sorting operation is required anyway.
In YugabyteDB, both Hash Join and Merge Join operations are not distributed inherently. However, they become efficient when they can take advantage of the distributed storage system. This efficiency is achieved through parallel scans across multiple tablets and the preservation of row order, thanks to range sharding.
When the join condition is highly selective, meaning that the join itself significantly reduces the number of rows involved, it becomes more efficient to push down this condition to the storage layer. This is where Nested Loops come into play, as they excel in such scenarios.
Nested Loops
yugabyte=# explain (costs off, analyze, dist)
/*+
leading ( ( (regions countries) cities ) )
nestloop( regions countries )
nestloop( regions countries cities )
set( yb_enable_optimizer_statistics on )
*/
select cities.population ,regions.name as region_name , countries.name as county_name , cities.name as city_name
from regions join countries using (region_id) join cities using (country_id)
where regions.name='Europe' and cities.population > 1e6
order by population desc ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Sort (actual time=52.537..52.540 rows=24 loops=1)
Sort Key: cities.population DESC
Sort Method: quicksort Memory: 26kB
-> Nested Loop (actual time=9.252..52.481 rows=24 loops=1)
-> Nested Loop (actual time=6.435..6.771 rows=46 loops=1)
-> Seq Scan on regions (actual time=2.362..2.364 rows=1 loops=1)
Remote Filter: ((name)::text = 'Europe'::text)
Storage Table Read Requests: 1
Storage Table Execution Time: 0.000 ms
-> Index Scan using countries_regions_fk001 on countries (actual time=4.064..4.360 rows=46 loops=1)
Index Cond: ((region_id)::text = (regions.region_id)::text)
Storage Index Read Requests: 1
Storage Index Execution Time: 0.000 ms
Storage Table Read Requests: 1
Storage Table Execution Time: 0.000 ms
-> Index Scan using cities_countries_fk001 on cities (actual time=0.960..0.961 rows=1 loops=46)
Index Cond: ((country_id)::text = (countries.country_id)::text)
Remote Filter: (population > '1000000'::numeric)
Storage Index Read Requests: 1
Storage Index Execution Time: 0.522 ms
Storage Table Read Requests: 1
Storage Table Execution Time: 0.261 ms
Planning Time: 1.685 ms
Execution Time: 53.387 ms
Storage Read Requests: 95
Storage Write Requests: 0
Storage Execution Time: 36.000 ms
Peak Memory Usage: 253 kB
(28 rows)
When we consider the number of rows involved, Nested Loops tend to be the most efficient because they apply the join condition alongside the single-table predicates. In this specific case, when scanning the countries, it returned only the relevant countries (rows=46
) for the region of interest, accomplished in just one loop (loops=1
) because there's only one region from the outer table (rows=1
from regions). This ideal scenario becomes less efficient as the number of rows from the outer table increases. In the second join, with cities, it must look up the rows from the 46 countries. With a nested loop, this number is determined by the outer table result (loops=46
).
In the execution plan, it's important to note that all statistics are provided per loop, and to get the complete picture, you need to multiply these values accordingly. For instance, the total time spent reading the inner table would be 0.961 * 46
. Additionally, when it comes to read requests, there is one Read Request
to the index and one to the table for each loop, resulting in a total of 2 * 46 read requests for the inner table. If these tables are distributed across different zones or regions, it's crucial to consider that latency applies to each individual loop, further impacting performance.
Nested Loop is indeed an effective join method when there are very few rows from the outer table. However, as the number of rows from the outer table increases, optimization strategies come into play. One optimization is to add more columns to the inner index, aiming to benefit from an Index Only Scan and reduce data retrieval overhead.
Another optimization, specific to YugabyteDB, is to minimize the number of loops using Batched Nested Loop. This optimization is crucial in YugabyteDB because each loop entails a remote call, making it a more substantial performance consideration compared to Monolithic PostgreSQL. In Monolithic PostgreSQL, these loops primarily read from shared buffers, resulting in different performance characteristics.
Batched Nested Loops
yugabyte=# explain (costs off, analyze, dist)
/*+
leading ( ( (regions countries) cities ) )
nestloop( regions countries )
nestloop( regions countries cities )
set( yb_bnl_batch_size 1024 )
set( yb_enable_optimizer_statistics on )
*/
select cities.population ,regions.name as region_name , countries.name as county_name , cities.name as city_name
from regions join countries using (region_id) join cities using (country_id)
where regions.name='Europe' and cities.population > 1e6
order by population desc ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Sort (actual time=32.558..32.560 rows=24 loops=1)
Sort Key: cities.population DESC
Sort Method: quicksort Memory: 26kB
-> YB Batched Nested Loop Join (actual time=30.420..32.067 rows=24 loops=1)
Join Filter: ((countries.country_id)::text = (cities.country_id)::text)
-> YB Batched Nested Loop Join (actual time=24.807..24.943 rows=46 loops=1)
Join Filter: ((regions.region_id)::text = (countries.region_id)::text)
-> Seq Scan on regions (actual time=16.913..16.916 rows=1 loops=1)
Remote Filter: ((name)::text = 'Europe'::text)
Storage Table Read Requests: 1
Storage Table Execution Time: 8.000 ms
-> Index Scan using countries_regions_fk001 on countries (actual time=7.339..7.443 rows=46 loops=1)
Index Cond: ((region_id)::text = ANY (ARRAY[(regions.region_id)::text, ($1)::text, ($2)::text, ..., ($1023)::text]))
Storage Index Read Requests: 1
Storage Index Execution Time: 0.000 ms
Storage Table Read Requests: 1
Storage Table Execution Time: 0.000 ms
-> Index Scan using cities_countries_fk001 on cities (actual time=5.368..6.984 rows=24 loops=1)
Index Cond: ((country_id)::text = ANY (ARRAY[(countries.country_id)::text, ($1025)::text, ($1026)::text, ..., ($2047)::text]))
Remote Filter: (population > '1000000'::numeric)
Storage Index Read Requests: 1
Storage Index Execution Time: 0.000 ms
Storage Table Read Requests: 3
Storage Table Execution Time: 4.000 ms
Planning Time: 6.627 ms
Execution Time: 38.111 ms
Storage Read Requests: 7
Storage Write Requests: 0
Storage Execution Time: 12.000 ms
Peak Memory Usage: 932 kB
(30 rows)
The key distinction lies in how the values from the outer tables are handled in Batched Nested Loop. They are batched into an array, effectively transforming the join condition into an IN or =ANY condition. In our example, with the batch size set to its maximum of 1024, which is greater than the 46 outer rows, all inner rows have been read in a single loop (rows=24 loops=1
). This approach effectively distributes the join per batch, reducing the number of read requests and subsequently minimizing the number of remote calls, which can significantly enhance performance.
In the majority of cases, Batched Nested Loop proves to be the optimal solution as it effectively prevents the retrieval of excessive rows and mitigates the impact of remote call latency. This join method is especially well-suited when the join condition involves filtering on the inner table. For instance, consider a scenario where you want to query all orders for a small subset of customers in a specific location by joining the customers table with the orders table. Using a Hash Join or a Merge Join would retrieve all orders, including those from other customers as well. In contrast, Batched Nested Loop will obtain the list of customer keys and incorporate it into the scan condition on the orders table, precisely targeting the desired subset of data, resulting in more efficient and selective queries.
Performance and scalability
Nested Loops, including batched Nested Loops, offer an advantage beyond raw performance: their performance is directly proportional to the number of outer rows, which typically corresponds to the driving table. This means that the response time scales in proportion to the result size, aligning with user expectations.
This scalability is particularly beneficial for scenarios like pagination, where you need to retrieve the top-n rows quickly. If the outer rows are sorted in the right order, Nested Loops can efficiently fetch the desired rows. The same applies to Merge Join, which requires that both tables are read from an index with the expected order.
In contrast, Hash Join necessitates reading and hashing the entire inner table before returning the first row, which can be less efficient for large datasets. To maintain scalability, it's often preferable for these tables to have fixed sizes, such as reference tables or, for growing tables, a single partition of them. This ensures predictable and consistent performance even as the data volume grows.
The query planner's role is to determine the appropriate join method, as well as the optimal join order and direction for a query. It achieves this by estimating the cardinality of join selectivity, often aided by the Cost-Based Optimizer. In cases where the optimizer's estimations may not be sufficient or when you want to provide hints for query execution, YugabyteDB is deployed with the pg_hint_plan extension.
Posted on September 11, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
May 16, 2024
January 30, 2024