Introduction to GBase 8c Performance Tuning Based on the TPCH Model

congcong

Cong Li

Posted on August 28, 2024

Introduction to GBase 8c Performance Tuning Based on the TPCH Model

In today's data-driven era, optimizing database performance is key for enterprises to enhance their competitiveness. This article will explore how to perform performance tuning on the GBase 8c database based on the TPCH (Transaction Processing Performance Council Benchmark H) model, providing readers with a scientifically rigorous, insightful, and well-organized tuning method.

I. Introduction to the TPCH Model

What is TPCH?
TPC-H, established by the Transaction Processing Performance Council (TPC) in 1994, is a standard benchmark designed for decision support systems in retail business scenarios. It is a tool to measure the capabilities of OLAP/HTAP databases and is based on a typical snowflake schema, consisting of 8 tables and 22 queries, adhering to the SQL92 standard.

The size of the data volume directly impacts query performance. TPC-H uses SF (Scale Factor) to describe data volume, where 1 SF corresponds to 1GB. The SF values increase in the following order: 1, 10, 30, 100, 300, 1000, 3000, 10000. 1 SF refers to the total data volume of the 8 tables, excluding indexes.

Table 1. TPCH 22 SQL Interpretations

SQL Name Business Context Statement Characteristics Standard Requirements
Q1 Price Statistics Report Query This is a pricing summary report querying the lineItems. It involves statistical analysis of various types of goods over a specific period, such as billed, shipped, discounted, and taxed items, including information on business volume billing, shipping, discount, tax, and average price. A single table query with grouping, sorting, and aggregation operations. This query can result in reading 95% to 97% of the rows in the table.
Q2 Minimum Cost Supplier Query Finds the supplier with the minimum cost. It identifies which supplier within a given region can supply a specified part (of a particular type and size) at the lowest price, allowing for decision-making in order placement. A multi-table query with sorting, aggregation, and subquery operations. The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 100 rows be returned (usually implemented by the application).
Q3 Shipping Priority Query Retrieves the top 10 revenue-generating orders that have not yet been shipped. It provides the shipping priorities and potential revenues (sum of l_extendedprice * (1-l_discount)) of the highest revenue orders not shipped before a given date. A three-table query with grouping, sorting, and aggregation operations. The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 10 rows be returned (usually implemented by the application).
Q4 Order Priority Query Calculates the number of orders within a given three-month period, where each order has at least one line item received by the customer after its commit date. A single-table query with grouping, sorting, aggregation, and correlated subquery operations.
Q5 Revenue from a Specific Region Supplier Query Retrieves the revenue statistics (calculated as sum(l_extendedprice * (1 - l_discount))) generated by suppliers from a specific region. This can be used to decide whether to establish a local distribution center in the given area. A multi-table join query with grouping, sorting, aggregation, and subquery operations.
Q6 Revenue Change Prediction Query Obtains the incremental revenue generated by changing discounts over a year. It is a typical "what-if" analysis used to find ways to increase revenue. The query considers all shipped orders in a specified year with discounts between "DISCOUNT-0.01" and "DISCOUNT+0.01" and calculates the total revenue increase by eliminating discounts on orders with l_quantity less than a specified quantity. A single-table query with aggregation operations, using the BETWEEN-AND operator, which can be optimized in some databases.
Q7 Shipping Profit Query Retrieves the profit from sales between the supplier's country and the country where goods are sold. This query determines the volume of goods shipped between two countries to aid in renegotiating shipping contracts. A multi-table query with grouping, sorting, aggregation, and subquery operations. The parent query does not have other query objects, making the subquery format relatively simple.
Q8 National Market Share Query Retrieves changes in the market share of a given part type in a specific country's region over the past two years. A query with grouping, sorting, aggregation, and subquery operations. The parent query has no other query objects, making it relatively simple. However, the subquery itself is a multi-table join query.
Q9 Product Type Profit Estimation Query Retrieves the total profit for all parts ordered per country per year. A query with grouping, sorting, aggregation, and subquery operations. The parent query has no other query objects, making it relatively simple. However, the subquery itself is a multi-table join query. The subquery uses the LIKE operator, which some query optimizers may not support for optimization.
Q10 Problematic Shipping Query Retrieves customers and losses associated with problematic shipments within a country over a three-month period. A multi-table join query with grouping, sorting, and aggregation operations. The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 10 rows be returned (usually implemented by the application).
Q11 Inventory Value Query Retrieves the value of parts supplied by a particular country in the inventory. A multi-table join query with grouping, sorting, aggregation, and subquery operations. The subquery is in the HAVING condition of the grouping operation.
Q12 Shipping Mode and Order Priority Query Obtains shipping modes and order priorities. It helps decide whether selecting cheaper shipping modes will result in more deliveries after the contract date, negatively impacting urgent orders. A two-table join query with grouping, sorting, and aggregation operations.
Q13 Customer Order Quantity Query Retrieves the order quantity for customers, including those who have no past or present order records. A query with grouping, sorting, aggregation, subquery, and left outer join operations. The TPC-H standard defines equivalent variations of the Q13 statement. The transformed SQL converts the subquery portion into a view, and then the view is queried, as some databases do not support.
Q14 Promotion Effect Query Retrieves the percentage of revenue from promotional parts in a given month. It helps monitor the market response to promotions. A query with grouping, sorting, aggregation, subquery, and left outer join operations. The TPC-H standard defines an equivalent variant of the Q14 statement, which is structurally similar to the above query, with different expressions for target columns.
Q15 Top Supplier Query Retrieves information about the top revenue-contributing supplier (ranked first) within a specific period. This can be used to decide which top suppliers to reward, give more orders, special certification, encouragement, or incentives. A join query between regular tables and views with grouping, sorting, aggregation, and aggregate subquery operations. The TPC-H standard defines an equivalent variation of the Q15 statement. Unlike the above query statement, which first defines a view and then joins the table and view, the transformed SQL defines a WITH statement, then joins the WITH object and the table.
Q16 Query for Suppliers of Parts with Specific Attributes Obtains the number of suppliers who can supply parts under specified conditions. This can be used to determine whether there are enough suppliers when order volumes are high and tasks are urgent. A two-table join query with grouping, sorting, aggregation, deduplication, and NOT IN subquery operations.
Q17 Small Order Revenue Query Retrieves small batch orders lower than 20% of the average supply volume. It determines the average item count for parts of a specified brand and packaging type in all orders over seven years (past and pending). It calculates the annual revenue loss if small orders are no longer accepted. This can be used to estimate the average annual revenue loss without small orders, as bulk shipments will reduce administrative costs. A two-table join query with aggregation and aggregate subquery operations.
Q18 Large Order Customer Query Retrieves supplier information for orders larger than a specified supply quantity. This can be used to verify if there are enough suppliers when order volumes are high and tasks are urgent. A three-table join query with grouping, sorting, aggregation, and IN subquery operations. The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 100 rows be returned (usually implemented by the application).
Q19 Discount Revenue Query Retrieves total discount revenue for all orders for three different types of parts transported by air or by hand, considering specific brand, packaging, and size ranges. This is an example of data mining tools generating formatted code. A three-table join query with grouping, sorting, aggregation, and IN subquery operations.
Q20 Supplier Competitiveness Query Identifies suppliers who can offer more competitive prices for a particular part within a specified year. A supplier is considered competitive if they have excess supply, with over 50% of a part's shipment to a specified country being supplied by that supplier in a given year. A two-table join query with sorting, aggregation, IN subquery, and regular subquery operations.
Q21 Late Delivery Supplier Query Retrieves suppliers who fail to deliver on time. A four-table join query with grouping, sorting, aggregation, EXISTS subquery, and NOT EXISTS subquery operations. The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 100 rows be returned (usually implemented by the application).
Q22 Global Sales Opportunity Query Retrieves the geographic distribution of potential consumer purchases. It calculates the number of consumers in specified countries who are more positive than average but have not placed orders in seven years, reflecting general consumer attitudes, i.e., purchase intentions. A four-table join query with grouping, sorting, aggregation, EXISTS subquery, and NOT EXISTS subquery operations.

II. Testing Process

GBase 8c, a multi-modal database, can use the TPCH model to measure its performance. Below is the TPCH testing process for the GBase 8c database.

1. Data Generation

Generate 100G data:

Compile dbgen:

cp makefile.suite makefile
make clean
make
Enter fullscreen mode Exit fullscreen mode

Generate data:

./dbgen -s 100 &
Enter fullscreen mode Exit fullscreen mode

2. Table Creation

Assume the database has been installed, deployed, and configured according to the TPCH business model.

gsql -p 5432 -d tpch -f tpch.sql
Enter fullscreen mode Exit fullscreen mode

The tpch.sql script can be obtained from the official TPC website.

3. Load Data into the Database

For example, loading data into the GBase 8c distributed database:

sh load.sh /data1/tpch/dbgen/data100/supplier/ 5432 tpch nation 
sh load.sh /data1/tpch/dbgen/data100/supplier/ 5432 tpch supplier
sh load.sh /data1/tpch/dbgen/data100/part/ 5432 tpch part
sh load.sh /data1/tpch/dbgen/data100/customer/ 5432 tpch customer
sh load.sh /data1/tpch/dbgen/data100/partsupp/ 5432 tpch partsupp
sh load.sh /data1/tpch/dbgen/data100/orders/ 5432 tpch orders
sh load.sh /data1/tpch/dbgen/data100/lineitem/ 5432 tpch lineitem
Enter fullscreen mode Exit fullscreen mode

Sample load.sh script:

#!/bin/bash
# Parameter check
if [ "$#" -ne 3 ]; then
   echo "Usage: $0 <data_directory> <port> <database_name>"
   exit 1
fi
DATA_DIR=$1
PORT=$2
DB_NAME=$3
# Load data into the database
for table in nation supplier part customer partsupp orders lineitem; do
   TABLE_FILE="${DATA_DIR}/${table}.tbl"
   if [ -f "$TABLE_FILE" ]; then
       echo "Loading ${table}..."
       psql -h localhost -p $PORT -d $DB_NAME -c "\copy ${table} FROM '${TABLE_FILE}' WITH DELIMITER '|' NULL AS ''"
   else
       echo "File ${TABLE_FILE} not found."
   fi
done
echo "Data loading completed."
Enter fullscreen mode Exit fullscreen mode

4. Check Data Volume in Tables

Table Name Number of Records
customer 15,000,000
lineitem 600,037,902
nation 25
orders 150,000,000
part 20,000,000
partsupp 80,000,000
region 5
supplier 1,000,000

5. Execute Performance Tests

gsql -p 5432 -d tpch -f tpch.sql
Enter fullscreen mode Exit fullscreen mode

The total time obtained represents the TPCH performance metric. During execution, adjust parameters such as work_memory and cstore_buffers based on the execution plan and resource usage to achieve minimal response time.

III. Performance Tuning

Database performance directly affects application response time and user experience. Performance tuning based on the TPCH model for GBase 8c can significantly improve system processing capabilities and stability. The basic principles of tuning are as follows:

  1. Understand the System Architecture: Gain an in-depth understanding of GBase 8c components, including the storage engine and query optimizer.
  2. Data Modeling and Index Optimization: Design data models effectively and create indexes to accelerate queries.
  3. Hardware Resource Allocation: Allocate CPU, memory, and storage resources based on system load.
  4. SQL Optimization: Write efficient SQL statements to avoid unnecessary calculations and data access.
  5. Monitoring and Analysis: Continuously monitor system performance, analyze bottlenecks, and make targeted optimizations.

Tuning Steps:

  1. Baseline Testing: Conduct baseline performance tests on GBase 8c using the TPCH model.
  2. Problem Identification: Use monitoring tools to identify performance bottlenecks, such as CPU usage and I/O wait times.
  3. Parameter Adjustment: Adjust database configuration parameters, such as cache size and concurrency control, based on bottlenecks.
  4. Index Optimization: Analyze query patterns and create or adjust indexes to improve query efficiency.
  5. SQL Tuning: Optimize SQL statements to reduce resource consumption and execution time.
  6. Hardware Upgrades: Upgrade hardware as needed, such as adding more memory or using faster storage media.
  7. Repeat Testing: After adjustments, rerun TPCH tests to verify performance improvements.

Through performance tuning of the GBase database based on the TPCH model, we have not only improved system processing capabilities but also gained a deeper understanding of performance tuning principles and methods. As technology continues to advance, performance tuning will remain a critical aspect of database management.

💖 💪 🙅 🚩
congcong
Cong Li

Posted on August 28, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related