🚀 Boosting TPC-H Q2 Query Performance by 1000x times: PawSQL Optimization Techniques

1️⃣ Introduction

This article dives into the process of optimizing the TPC-H benchmark Q2 query using PawSQL. We will explore the specific strategies involved in query rewriting, index optimization, and execution plan improvements, supported by a detailed quantitative analysis to assess the impact of these optimizations on performance.

You can find the details of this optimization @ https://www.pawsql.com/statement/1837384704930025474

2️⃣ Original Query Analysis

The original Q2 query involves six tables (supplier, nation, partsupp, part, lineitem), with complex multi-level nested subqueries and several join conditions. The goal of the query is to identify suppliers from a specific country (e.g., ALGERIA) who provide parts (with names starting with "green") at the lowest cost.

select supplier.s_name, supplier.s_address
from supplier, nation
where supplier.s_suppkey in (
    select partsupp.ps_suppkey
    from partsupp
    where partsupp.ps_partkey in (
    select part.p_partkey
    from part
    where part.p_name like 'green%')
    and partsupp.ps_availqty > (select 0.5 * sum(lineitem.l_quantity)
                        from lineitem
                        where lineitem.l_partkey = partsupp.ps_partkey
                           and lineitem.l_suppkey = partsupp.ps_suppkey
                           and lineitem.l_shipdate >= date '1997-01-01'
                           and lineitem.l_shipdate < date '1997-01-01' + interval '1' YEAR))
    and supplier.s_nationkey = nation.n_nationkey
    and nation.n_name = 'ALGERIA'
order by supplier.s_name

3️⃣ Query Rewriting Optimization

3.1 Subquery Transformation

PawSQL optimizes the SQL structure by transforming the IN subquery into a more efficient EXISTS subquery. The optimized query is as follows:

SELECT supplier.s_name, supplier.s_address
FROM supplier, nation
WHERE EXISTS (
  SELECT partsupp.ps_suppkey
  FROM partsupp, part
  WHERE part.p_name LIKE 'green%'
    AND part.p_partkey = partsupp.ps_partkey
    AND partsupp.ps_availqty > (
      SELECT 0.5 * SUM(lineitem.l_quantity)
      FROM lineitem
      WHERE lineitem.l_partkey = partsupp.ps_partkey
        AND lineitem.l_suppkey = partsupp.ps_suppkey
        AND lineitem.l_shipdate >= DATE '1997-01-01'
        AND lineitem.l_shipdate < DATE '1997-01-01' + INTERVAL '1' YEAR
    )
    AND partsupp.ps_suppkey = supplier.s_suppkey
)
  AND supplier.s_nationkey = nation.n_nationkey
  AND nation.n_name = 'ALGERIA'
ORDER BY supplier.s_name;

Highlights of SQL Difference

3.2 Key Optimization Points

Subquery Simplification: Eliminates one layer of nested subqueries, simplifying the overall structure.
Condition Merging: Combines conditions on part and partsupp tables, reducing intermediate result sets.
Maintains Subquery Logic: Retains the necessary subquery structure for further execution plan optimizations.

4️⃣ 🔍 Index Optimization Strategy

PawSQL recommends creating the following indexes for the Q2 query:

CREATE INDEX PAWSQL_IDX0357178651 ON tpch.nation(N_NAME, N_NATIONKEY, N_REGIONKEY);
CREATE INDEX PAWSQL_IDX0327029402 ON tpch.part(P_NAME);
CREATE INDEX PAWSQL_IDX0485218972 ON tpch.lineitem(L_PARTKEY, L_SUPPKEY, L_SHIPDATE);
CREATE INDEX PAWSQL_IDX0214365528 ON tpch.supplier(S_NATIONKEY, S_SUPPKEY, S_NAME, S_ADDRESS);

Index Optimization Analysis:

Nation Table Index: Supports fast lookups by nation.n_name for efficient country-specific filtering.
Part Table Index: Optimizes the part.p_name LIKE query for better filtering performance.
Lineitem Table Index: Covers join conditions and date range filtering, minimizing full table scans.
Supplier Table Index: Enhances performance for joins with the nation table and supports final result ordering.

5️⃣ Execution Plan Comparison

5.1 Pre-Optimization Execution Plan

Nested Loop Joins: Multiple nested loop joins result in poor performance.
Full Table Scan on Part Table: Processes 2000 rows, causing inefficiencies.
Full Table Scan on Lineitem Table: Scans 60,175 rows per iteration, leading to a high processing overhead.

5.2 Post-Optimization Execution Plan

Efficient Index Lookups and Range Scans: New indexes enable faster lookups and efficient range scans.
Part Table: With PAWSQL_IDX0327029402 index, the scan is reduced to 16 rows.
Lineitem Table: Uses PAWSQL_IDX0485218972 index, significantly reducing the number of rows processed.
Nation and Supplier Tables: Indexes provide precise matching, improving join performance.

Plan Comparison of TPC-H Q2

5.3 Key Improvements

Nation Table: Switched from a full table scan to an index lookup, reducing rows processed from 25 to 1.
Part Table: Converted from a full table scan to an index range scan, lowering rows processed from 2000 to 16.
Lineitem Table: Replaced full table scans with indexed lookups, reducing processed rows from 60,175 to approximately 1 per iteration.

6️⃣ Quantitative Performance Analysis

Execution Time: Reduced from approximately 1433.535 milliseconds to 0.894 milliseconds.
Performance Improvement: Achieved a massive performance boost of approximately 160,250.60%.
Primary Contributing Factors: Index optimizations resulted in significant improvements in data access patterns and a sharp reduction in the number of processed rows.

7️⃣ Additional Optimization Suggestions

Index Cleanup: Consider removing redundant indexes on the lineitem table to improve DML performance.
Constant Filter Optimization: For frequently used constant filters (e.g., nation.n_name = 'ALGERIA'), prioritize using indexed lookups for faster execution.

8️⃣ Conclusion 🏆

This case study demonstrates the effectiveness of PawSQL in optimizing complex queries. Key takeaways include:

Query Rewriting: Simplifies query structure and provides the optimizer with more opportunities for efficiency.
Index Design: Well-designed indexes significantly reduce data access volume and processing rows.
Execution Plan Optimization: Reducing full table scans and adjusting join orders are critical to performance gains.

By applying these optimization strategies, query performance saw an exponential improvement. This showcases not only the value of professional SQL optimization tools like PawSQL but also serves as a valuable reference for optimizing queries in large-scale data processing environments.

In practice, ongoing index and query optimization, alongside regular evaluation of data distribution and query patterns, are essential for maintaining and improving system performance and resource efficiency over time.

Blog