Introduction to GBase 8c Performance Tuning Based on the TPCH Model
Cong Li
Posted on August 28, 2024
In today's data-driven era, optimizing database performance is key for enterprises to enhance their competitiveness. This article will explore how to perform performance tuning on the GBase 8c database based on the TPCH (Transaction Processing Performance Council Benchmark H) model, providing readers with a scientifically rigorous, insightful, and well-organized tuning method.
I. Introduction to the TPCH Model
What is TPCH?
TPC-H, established by the Transaction Processing Performance Council (TPC) in 1994, is a standard benchmark designed for decision support systems in retail business scenarios. It is a tool to measure the capabilities of OLAP/HTAP databases and is based on a typical snowflake schema, consisting of 8 tables and 22 queries, adhering to the SQL92 standard.
The size of the data volume directly impacts query performance. TPC-H uses SF (Scale Factor) to describe data volume, where 1 SF corresponds to 1GB. The SF values increase in the following order: 1, 10, 30, 100, 300, 1000, 3000, 10000. 1 SF refers to the total data volume of the 8 tables, excluding indexes.
Table 1. TPCH 22 SQL Interpretations
SQL | Name | Business Context | Statement Characteristics | Standard Requirements |
---|---|---|---|---|
Q1 | Price Statistics Report Query | This is a pricing summary report querying the lineItems. It involves statistical analysis of various types of goods over a specific period, such as billed, shipped, discounted, and taxed items, including information on business volume billing, shipping, discount, tax, and average price. | A single table query with grouping, sorting, and aggregation operations. This query can result in reading 95% to 97% of the rows in the table. | |
Q2 | Minimum Cost Supplier Query | Finds the supplier with the minimum cost. It identifies which supplier within a given region can supply a specified part (of a particular type and size) at the lowest price, allowing for decision-making in order placement. | A multi-table query with sorting, aggregation, and subquery operations. | The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 100 rows be returned (usually implemented by the application). |
Q3 | Shipping Priority Query | Retrieves the top 10 revenue-generating orders that have not yet been shipped. It provides the shipping priorities and potential revenues (sum of l_extendedprice * (1-l_discount)) of the highest revenue orders not shipped before a given date. | A three-table query with grouping, sorting, and aggregation operations. | The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 10 rows be returned (usually implemented by the application). |
Q4 | Order Priority Query | Calculates the number of orders within a given three-month period, where each order has at least one line item received by the customer after its commit date. | A single-table query with grouping, sorting, aggregation, and correlated subquery operations. | |
Q5 | Revenue from a Specific Region Supplier Query | Retrieves the revenue statistics (calculated as sum(l_extendedprice * (1 - l_discount))) generated by suppliers from a specific region. This can be used to decide whether to establish a local distribution center in the given area. | A multi-table join query with grouping, sorting, aggregation, and subquery operations. | |
Q6 | Revenue Change Prediction Query | Obtains the incremental revenue generated by changing discounts over a year. It is a typical "what-if" analysis used to find ways to increase revenue. The query considers all shipped orders in a specified year with discounts between "DISCOUNT-0.01" and "DISCOUNT+0.01" and calculates the total revenue increase by eliminating discounts on orders with l_quantity less than a specified quantity. | A single-table query with aggregation operations, using the BETWEEN-AND operator, which can be optimized in some databases. | |
Q7 | Shipping Profit Query | Retrieves the profit from sales between the supplier's country and the country where goods are sold. This query determines the volume of goods shipped between two countries to aid in renegotiating shipping contracts. | A multi-table query with grouping, sorting, aggregation, and subquery operations. The parent query does not have other query objects, making the subquery format relatively simple. | |
Q8 | National Market Share Query | Retrieves changes in the market share of a given part type in a specific country's region over the past two years. | A query with grouping, sorting, aggregation, and subquery operations. The parent query has no other query objects, making it relatively simple. However, the subquery itself is a multi-table join query. | |
Q9 | Product Type Profit Estimation Query | Retrieves the total profit for all parts ordered per country per year. | A query with grouping, sorting, aggregation, and subquery operations. The parent query has no other query objects, making it relatively simple. However, the subquery itself is a multi-table join query. The subquery uses the LIKE operator, which some query optimizers may not support for optimization. | |
Q10 | Problematic Shipping Query | Retrieves customers and losses associated with problematic shipments within a country over a three-month period. | A multi-table join query with grouping, sorting, and aggregation operations. | The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 10 rows be returned (usually implemented by the application). |
Q11 | Inventory Value Query | Retrieves the value of parts supplied by a particular country in the inventory. | A multi-table join query with grouping, sorting, aggregation, and subquery operations. The subquery is in the HAVING condition of the grouping operation. | |
Q12 | Shipping Mode and Order Priority Query | Obtains shipping modes and order priorities. It helps decide whether selecting cheaper shipping modes will result in more deliveries after the contract date, negatively impacting urgent orders. | A two-table join query with grouping, sorting, and aggregation operations. | |
Q13 | Customer Order Quantity Query | Retrieves the order quantity for customers, including those who have no past or present order records. | A query with grouping, sorting, aggregation, subquery, and left outer join operations. | The TPC-H standard defines equivalent variations of the Q13 statement. The transformed SQL converts the subquery portion into a view, and then the view is queried, as some databases do not support. |
Q14 | Promotion Effect Query | Retrieves the percentage of revenue from promotional parts in a given month. It helps monitor the market response to promotions. | A query with grouping, sorting, aggregation, subquery, and left outer join operations. | The TPC-H standard defines an equivalent variant of the Q14 statement, which is structurally similar to the above query, with different expressions for target columns. |
Q15 | Top Supplier Query | Retrieves information about the top revenue-contributing supplier (ranked first) within a specific period. This can be used to decide which top suppliers to reward, give more orders, special certification, encouragement, or incentives. | A join query between regular tables and views with grouping, sorting, aggregation, and aggregate subquery operations. | The TPC-H standard defines an equivalent variation of the Q15 statement. Unlike the above query statement, which first defines a view and then joins the table and view, the transformed SQL defines a WITH statement, then joins the WITH object and the table. |
Q16 | Query for Suppliers of Parts with Specific Attributes | Obtains the number of suppliers who can supply parts under specified conditions. This can be used to determine whether there are enough suppliers when order volumes are high and tasks are urgent. | A two-table join query with grouping, sorting, aggregation, deduplication, and NOT IN subquery operations. | |
Q17 | Small Order Revenue Query | Retrieves small batch orders lower than 20% of the average supply volume. It determines the average item count for parts of a specified brand and packaging type in all orders over seven years (past and pending). It calculates the annual revenue loss if small orders are no longer accepted. This can be used to estimate the average annual revenue loss without small orders, as bulk shipments will reduce administrative costs. | A two-table join query with aggregation and aggregate subquery operations. | |
Q18 | Large Order Customer Query | Retrieves supplier information for orders larger than a specified supply quantity. This can be used to verify if there are enough suppliers when order volumes are high and tasks are urgent. | A three-table join query with grouping, sorting, aggregation, and IN subquery operations. | The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 100 rows be returned (usually implemented by the application). |
Q19 | Discount Revenue Query | Retrieves total discount revenue for all orders for three different types of parts transported by air or by hand, considering specific brand, packaging, and size ranges. This is an example of data mining tools generating formatted code. | A three-table join query with grouping, sorting, aggregation, and IN subquery operations. | |
Q20 | Supplier Competitiveness Query | Identifies suppliers who can offer more competitive prices for a particular part within a specified year. A supplier is considered competitive if they have excess supply, with over 50% of a part's shipment to a specified country being supplied by that supplier in a given year. | A two-table join query with sorting, aggregation, IN subquery, and regular subquery operations. | |
Q21 | Late Delivery Supplier Query | Retrieves suppliers who fail to deliver on time. | A four-table join query with grouping, sorting, aggregation, EXISTS subquery, and NOT EXISTS subquery operations. | The query does not syntactically limit the number of tuples returned, but TPC-H standards require that only the top 100 rows be returned (usually implemented by the application). |
Q22 | Global Sales Opportunity Query | Retrieves the geographic distribution of potential consumer purchases. It calculates the number of consumers in specified countries who are more positive than average but have not placed orders in seven years, reflecting general consumer attitudes, i.e., purchase intentions. | A four-table join query with grouping, sorting, aggregation, EXISTS subquery, and NOT EXISTS subquery operations. |
II. Testing Process
GBase 8c, a multi-modal database, can use the TPCH model to measure its performance. Below is the TPCH testing process for the GBase 8c database.
1. Data Generation
Generate 100G data:
Compile dbgen:
cp makefile.suite makefile
make clean
make
Generate data:
./dbgen -s 100 &
2. Table Creation
Assume the database has been installed, deployed, and configured according to the TPCH business model.
gsql -p 5432 -d tpch -f tpch.sql
The tpch.sql
script can be obtained from the official TPC website.
3. Load Data into the Database
For example, loading data into the GBase 8c distributed database:
sh load.sh /data1/tpch/dbgen/data100/supplier/ 5432 tpch nation
sh load.sh /data1/tpch/dbgen/data100/supplier/ 5432 tpch supplier
sh load.sh /data1/tpch/dbgen/data100/part/ 5432 tpch part
sh load.sh /data1/tpch/dbgen/data100/customer/ 5432 tpch customer
sh load.sh /data1/tpch/dbgen/data100/partsupp/ 5432 tpch partsupp
sh load.sh /data1/tpch/dbgen/data100/orders/ 5432 tpch orders
sh load.sh /data1/tpch/dbgen/data100/lineitem/ 5432 tpch lineitem
Sample load.sh
script:
#!/bin/bash
# Parameter check
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <data_directory> <port> <database_name>"
exit 1
fi
DATA_DIR=$1
PORT=$2
DB_NAME=$3
# Load data into the database
for table in nation supplier part customer partsupp orders lineitem; do
TABLE_FILE="${DATA_DIR}/${table}.tbl"
if [ -f "$TABLE_FILE" ]; then
echo "Loading ${table}..."
psql -h localhost -p $PORT -d $DB_NAME -c "\copy ${table} FROM '${TABLE_FILE}' WITH DELIMITER '|' NULL AS ''"
else
echo "File ${TABLE_FILE} not found."
fi
done
echo "Data loading completed."
4. Check Data Volume in Tables
Table Name | Number of Records |
---|---|
customer | 15,000,000 |
lineitem | 600,037,902 |
nation | 25 |
orders | 150,000,000 |
part | 20,000,000 |
partsupp | 80,000,000 |
region | 5 |
supplier | 1,000,000 |
5. Execute Performance Tests
gsql -p 5432 -d tpch -f tpch.sql
The total time
obtained represents the TPCH performance metric. During execution, adjust parameters such as work_memory
and cstore_buffers
based on the execution plan and resource usage to achieve minimal response time.
III. Performance Tuning
Database performance directly affects application response time and user experience. Performance tuning based on the TPCH model for GBase 8c can significantly improve system processing capabilities and stability. The basic principles of tuning are as follows:
- Understand the System Architecture: Gain an in-depth understanding of GBase 8c components, including the storage engine and query optimizer.
- Data Modeling and Index Optimization: Design data models effectively and create indexes to accelerate queries.
- Hardware Resource Allocation: Allocate CPU, memory, and storage resources based on system load.
- SQL Optimization: Write efficient SQL statements to avoid unnecessary calculations and data access.
- Monitoring and Analysis: Continuously monitor system performance, analyze bottlenecks, and make targeted optimizations.
Tuning Steps:
- Baseline Testing: Conduct baseline performance tests on GBase 8c using the TPCH model.
- Problem Identification: Use monitoring tools to identify performance bottlenecks, such as CPU usage and I/O wait times.
- Parameter Adjustment: Adjust database configuration parameters, such as cache size and concurrency control, based on bottlenecks.
- Index Optimization: Analyze query patterns and create or adjust indexes to improve query efficiency.
- SQL Tuning: Optimize SQL statements to reduce resource consumption and execution time.
- Hardware Upgrades: Upgrade hardware as needed, such as adding more memory or using faster storage media.
- Repeat Testing: After adjustments, rerun TPCH tests to verify performance improvements.
Through performance tuning of the GBase database based on the TPCH model, we have not only improved system processing capabilities but also gained a deeper understanding of performance tuning principles and methods. As technology continues to advance, performance tuning will remain a critical aspect of database management.
Posted on August 28, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024