Nothing is perfect. In terms of databases, you can't expect the best performance for every task and query from your deployed database. However, the vital step as a software developer is to know their strengths and weaknesses and how to deal with them.
In this post, I will compare Clickhouse as a representative of the OLAP database and MySQL of the OLTP. This will help us to choose better solutions for our challenges according to our conditions and desires. Before jumping into the main context, let's discuss OLTP, OLAP, MySQL, and ClickHouse.
OLTP
OLTP stands for Online Transaction Processing and is used for day-to-day operations, such as processing orders and updating customer information. OLTP is best for short, fast transactions and is optimized for quick response times. It is essential to ensure data accuracy and consistency and provide an efficient way to access data.
OLAP
OLAP stands for Online Analytical Processing and is used for data mining and analysis. It enables organizations to analyze large amounts of data from multiple perspectives and identify trends and patterns. OLAP is best for complex queries and data mining and can provide impossible insights with traditional reporting tools.
MySQL
MySQL is a popular open-source database management system. It is used to store and manage data and is utilized by websites and applications to store and manage information. MySQL is a relational database management system that holds data in tables and allows users to query the data. It also provides features such as triggers, stored procedures, and views. MySQL is easy to use and has a wide range of features that can be used to create powerful and efficient applications.
ClickHouse
ClickHouse is an open-source column-oriented database management system developed by Yandex. It is designed to provide high performance for analytical queries.
ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. You can know more about this database by visiting the first part of this series:
I've followed the Clickbench repository methodology for the case study. It uses the hits dataset obtained from the actual traffic recording of one of the world's largest web analytics platforms. hits contain about 100M rows as a single flat table. This repository studies more than 20 databases regarding dataset load time, elapsed time for 43 OLAP queries, and occupied storage. You can access their visualized results here.
This benchmark represents typical workload in the following areas: clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. It covers the typical queries in ad-hoc analytics and real-time dashboards.
The dataset from this benchmark was obtained from the actual traffic recording of one of the world's largest web analytics platforms. It is anonymized while keeping all the essential distributions of the data. The set of queries was improvised to reflect the realistic workloads, while the queries are not directly from production.
Goals
The main goals of this benchmark are:
Reproducibility
You can quickly reproduce every test in as little as 20 minutes (although some systems may take several hours) in a semi-automated way. The test setup is documented and uses inexpensive cloud VMs. The test process is documented in the form of a shell script, covering…
To investigate ClickHouse and MySQL performance specifically, I separated 10M rows of the table and chose some of the predefined queries that can make our point more clear. Those queries are mainly in OLAP manner, so they only show ClickHouse strengths compared to MySQL (i.e., MySQL loses in all those queries). Hence, I added other queries showing the opposite (OLTP queries). Although I've limited the benchmark to these two databases, you can generalize the concept to other row-oriented and column-oriented DBMSs.
Disclaimer: This benchmark only clarifies the main difference between column-oriented and row-oriented databases regarding their performance and use cases. It should not be considered a reference for your use cases. Hence, you should perform your benchmarks with your queries to achieve the best decision.
System Specification
Databases are installed on Ubuntu 22.04 LTS on a system with the below specifications:
Data (hits.tsv) is loaded into the table, and its time is measured.
Queries are run, and each query's elapsed time is measured.
Queries
Query Number
Statement
Type
1
SELECT COUNT(*) FROM hits;
OLAP
2
SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;
OLAP
3
SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10;
OLAP
4
SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;
OLAP
5
SELECT EventTime, WatchID FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';
OLTP
6
SELECT Title, URL, Referer FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';
OLTP
7
UPDATE hits SET Title='my title', URL='my url', Referer='my referer' WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';
OLTP
Results
I'll study the results under four categories:
Dataset Load
Table Size
Read Queries Execution
Update Query Execution: I've discussed the update query (query number 7) separately since it needs more discussion and attention.
Dataset Load
ClickHouse
MySQL
Ratio
65s
11m35s
x10.7
Thanks to the LSM and sparse indexes, ClickHouse load time is much faster than MySQL, which uses BTree. However, ClickHouse inserts efficiency is observable in bulk inserts instead of many individual inserts. This behavior comes from the fact that it creates immutable parts for each insert and is unwilling to change, remove or create its data for a few rows.
Table Size
ClickHouse (GiB)
MySQL (GiB)
Ratio
1.3
6.32
x4.86
The column-oriented structure gives the ability of Data Compression, something that is not available in row-oriented databases. That is why ClickHouse can do a practical favor to the teams storing a high amount of data, reducing the storage cost.
Read Queries Execution
Query Number
ClickHouse (s)
MySQL (s)
Ratio
1
0.005
7.79
x1558
2
0.030
16.0
x533.3
3
0.193
4.35
x22.5
4
2.600
180.93
x69.58
5
0.01
0.00
x0
6
0.011
0.00
x0
ClickHouse's sparse index and column-oriented structure have outperformed MySQL in all OLAP queries (numbers 1 to 4). That's why BI and Data Analysts would be more than happy with ClickHouse for their daily reports.
However, MySQL wins the battle when it comes to OLTP queries (numbers 5 and 6). Btree (equipped by MySQL) indeed performs better for pointy queries in which you demand short transactions requiring few rows.
Update Query Execution
For the update query (number 7), we should execute a different query in ClickHouse as it doesn't support updates in a naive way, and the Alter command has to be used:
Additionally, ClickHouse applies the update asynchronously. To have the result immediately, you've to perform an optimize command:
OPTIMIZETABLEhitsFINAL;
By performing query number 7 statement shown in the queries table for MySQL and the two above SQL statements for ClickHouse, we achieve the below results:
Query Number
ClickHouse (s)
MySQL (s)
Ratio
7
26
0.00
0
Again, ClickHouse mutation hatred makes it a loser for real-time updates (and similarly deletes) compared to MySQL. Consequently, other methods like deduplication using ReplacingMergeTree can be utilized to handle updates. You can find valuable resources in the below links:
In this post, I benchmarked MySQL and ClickHouse databases to study some of their strengths and weaknesses that may help us choose a suitable solution. To summarize:
MySQL performs better on pointy and OLTP queries.
ClickHoues performs better on OLAP queries.
ClickHouse is not designed for frequent updates and deletes. You have to handle them with deduplication methods.
ClickHouse reduces the storage cost thanks to its column-oriented structure.
ClickHouse bulk inserts load time operates far better than MySQL.