ClickHouse Advanced Tutorial: Performance Comparison with MySQL

hoptical

Hamed Karbasi

Posted on June 8, 2023

ClickHouse Advanced Tutorial: Performance Comparison with MySQL

Table of Contents

 1. Introduction

       1.1. OLTP

       1.2. OLAP

       1.3. MySQL

       1.4. ClickHouse
 2. Comparison Case Study

       2.5. System Specification

       2.6. Benchmark Flow

       2.7. Queries
 3. Results

       3.8. Dataset Load

       3.9. Table Size

       3.10. Read Queries Execution

       3.11. Update Query Execution
 4. Conclusion

Introduction

Nothing is perfect. In terms of databases, you can't expect the best performance for every task and query from your deployed database. However, the vital step as a software developer is to know their strengths and weaknesses and how to deal with them.

In this post, I will compare Clickhouse as a representative of the OLAP database and MySQL of the OLTP. This will help us to choose better solutions for our challenges according to our conditions and desires. Before jumping into the main context, let's discuss OLTP, OLAP, MySQL, and ClickHouse.

OLTP

OLTP stands for Online Transaction Processing and is used for day-to-day operations, such as processing orders and updating customer information. OLTP is best for short, fast transactions and is optimized for quick response times. It is essential to ensure data accuracy and consistency and provide an efficient way to access data.

OLAP

OLAP stands for Online Analytical Processing and is used for data mining and analysis. It enables organizations to analyze large amounts of data from multiple perspectives and identify trends and patterns. OLAP is best for complex queries and data mining and can provide impossible insights with traditional reporting tools.

MySQL

MySQL is a popular open-source database management system. It is used to store and manage data and is utilized by websites and applications to store and manage information. MySQL is a relational database management system that holds data in tables and allows users to query the data. It also provides features such as triggers, stored procedures, and views. MySQL is easy to use and has a wide range of features that can be used to create powerful and efficient applications.

ClickHouse

ClickHouse is an open-source column-oriented database management system developed by Yandex. It is designed to provide high performance for analytical queries.
ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. You can know more about this database by visiting the first part of this series:

Now we can talk about the performance comparison.

Comparison Case Study

I've followed the Clickbench repository methodology for the case study. It uses the hits dataset obtained from the actual traffic recording of one of the world's largest web analytics platforms. hits contain about 100M rows as a single flat table. This repository studies more than 20 databases regarding dataset load time, elapsed time for 43 OLAP queries, and occupied storage. You can access their visualized results here.

GitHub logo ClickHouse / ClickBench

ClickBench: a Benchmark For Analytical Databases

ClickBench: a Benchmark For Analytical Databases

https://benchmark.clickhouse.com/

Discussion: https://news.ycombinator.com/item?id=32084571

Overview

This benchmark represents typical workload in the following areas: clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. It covers the typical queries in ad-hoc analytics and real-time dashboards.

The dataset from this benchmark was obtained from the actual traffic recording of one of the world's largest web analytics platforms. It is anonymized while keeping all the essential distributions of the data. The set of queries was improvised to reflect the realistic workloads, while the queries are not directly from production.

Goals

The main goals of this benchmark are:

Reproducibility

You can quickly reproduce every test in as little as 20 minutes (although some systems may take several hours) in a semi-automated way. The test setup is documented and uses inexpensive cloud VMs. The test process is documented in the form of a shell script, covering…

To investigate ClickHouse and MySQL performance specifically, I separated 10M rows of the table and chose some of the predefined queries that can make our point more clear. Those queries are mainly in OLAP manner, so they only show ClickHouse strengths compared to MySQL (i.e., MySQL loses in all those queries). Hence, I added other queries showing the opposite (OLTP queries). Although I've limited the benchmark to these two databases, you can generalize the concept to other row-oriented and column-oriented DBMSs.

Disclaimer: This benchmark only clarifies the main difference between column-oriented and row-oriented databases regarding their performance and use cases. It should not be considered a reference for your use cases. Hence, you should perform your benchmarks with your queries to achieve the best decision.

System Specification

Databases are installed on Ubuntu 22.04 LTS on a system with the below specifications:

  • CPU: Intel® Core™ i7-10510U CPU @ 1.80GHz × 8
  • RAM: 16 GiB
  • Storage: 256 GiB SSD

Benchmark Flow

  1. The database is created.
  2. The table is created with the defined DDL.
  3. Data (hits.tsv) is loaded into the table, and its time is measured.
  4. Queries are run, and each query's elapsed time is measured.

Queries

Query Number Statement Type
1 SELECT COUNT(*) FROM hits; OLAP
2 SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits; OLAP
3 SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10; OLAP
4 SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10; OLAP
5 SELECT EventTime, WatchID FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239'; OLTP
6 SELECT Title, URL, Referer FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239'; OLTP
7 UPDATE hits SET Title='my title', URL='my url', Referer='my referer' WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239'; OLTP

Results

I'll study the results under four categories:

  • Dataset Load
  • Table Size
  • Read Queries Execution
  • Update Query Execution: I've discussed the update query (query number 7) separately since it needs more discussion and attention.

Dataset Load

ClickHouse MySQL Ratio
65s 11m35s x10.7

Thanks to the LSM and sparse indexes, ClickHouse load time is much faster than MySQL, which uses BTree. However, ClickHouse inserts efficiency is observable in bulk inserts instead of many individual inserts. This behavior comes from the fact that it creates immutable parts for each insert and is unwilling to change, remove or create its data for a few rows.

Table Size

ClickHouse (GiB) MySQL (GiB) Ratio
1.3 6.32 x4.86

The column-oriented structure gives the ability of Data Compression, something that is not available in row-oriented databases. That is why ClickHouse can do a practical favor to the teams storing a high amount of data, reducing the storage cost.

Read Queries Execution

Query Number ClickHouse (s) MySQL (s) Ratio
1 0.005 7.79 x1558
2 0.030 16.0 x533.3
3 0.193 4.35 x22.5
4 2.600 180.93 x69.58
5 0.01 0.00 x0
6 0.011 0.00 x0

ClickHouse's sparse index and column-oriented structure have outperformed MySQL in all OLAP queries (numbers 1 to 4). That's why BI and Data Analysts would be more than happy with ClickHouse for their daily reports.

However, MySQL wins the battle when it comes to OLTP queries (numbers 5 and 6). Btree (equipped by MySQL) indeed performs better for pointy queries in which you demand short transactions requiring few rows.

Update Query Execution

For the update query (number 7), we should execute a different query in ClickHouse as it doesn't support updates in a naive way, and the Alter command has to be used:



ALTER TABLE hits UPDATE JavaEnable=0 WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';


Enter fullscreen mode Exit fullscreen mode

Additionally, ClickHouse applies the update asynchronously. To have the result immediately, you've to perform an optimize command:



OPTIMIZE TABLE hits FINAL;


Enter fullscreen mode Exit fullscreen mode

By performing query number 7 statement shown in the queries table for MySQL and the two above SQL statements for ClickHouse, we achieve the below results:

Query Number ClickHouse (s) MySQL (s) Ratio
7 26 0.00 0

Again, ClickHouse mutation hatred makes it a loser for real-time updates (and similarly deletes) compared to MySQL. Consequently, other methods like deduplication using ReplacingMergeTree can be utilized to handle updates. You can find valuable resources in the below links:

Conclusion

In this post, I benchmarked MySQL and ClickHouse databases to study some of their strengths and weaknesses that may help us choose a suitable solution. To summarize:

  • MySQL performs better on pointy and OLTP queries.
  • ClickHoues performs better on OLAP queries.
  • ClickHouse is not designed for frequent updates and deletes. You have to handle them with deduplication methods.
  • ClickHouse reduces the storage cost thanks to its column-oriented structure.
  • ClickHouse bulk inserts load time operates far better than MySQL.
💖 💪 🙅 🚩
hoptical
Hamed Karbasi

Posted on June 8, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related