Introduction to Clickhouse at scale
Mohammad Arab Anvari
Posted on May 17, 2023
Introduction
In many cases we prefer to scale our services, we always prefer scale out against scale up because of the lower cost of scaling out.
Scaling out = adding more components in parallel to spread out a load. Scaling up = making a component bigger or faster so that it can handle more load. Source
In Clickhouse terminology, Scale out is equal to sharding. And we use replicas to ensure availability. In this article, we will learn about Cluster, Replicas, and Sharding in Clickhouse.
Replication used for data integrity and automatic failover. Sharding is used for horizontal scaling of the cluster. From Reference 1
Sharding
Suppose we have a table on a Clickhouse node (yellow table in host1
). After a while data become bigger and the request rate increased. Now we decide to scale the node, here host2
comes to the scene. We apply sharding to the table and send a second shard on host2
.
-
How to make a select query to the table?
- We can directly query a table in each host. In this case, we should know what data exist in each shard.
- We can create a
distributed table
. It can create on each node, when we query todistributed table
it ingests data from the proper shard. It doesn't store data but has metadata about data in each shard.
-
How to insert data into the table?
- We can directly insert our data into the shards if we have a predefined schema or something same.
- Also we cant insert data into the
distributed table
and it inserts data regarding the definedsharding key
. You can read more about sharding key in Clickhouse Doc
Replication
Suppose we run an application depending on Clickhouse or we need online analytics which uses Clickhouse. What happens if one node becomes down or a hardware issue rises? Certainly, we don't want that, so we need replication for each table.
We can only have replicated tables for *MergeTree* tables.
Clickhouse Keeper is who manages things about replication in Clickhouse. It is compatible with Zookeeper and it's a kind of alternative for Zookeeper which Clickhouse presents.
Cluster
A cluster is a way to manage sharding and replication between some nodes. We can have many cluster topologies for the same nodes. Every time we add a table to one cluster, It shards and replicates in a way defined in that cluster.
How to apply Sharding/Replication/Clustering in Clickhouse?
Set up Clickhouse Cluster
In the below picture, we have 4 nodes. We define a cluster named cluster1
(in the Clickhouse config file). Each table associated with this cluster will have 2 replicas and 2 shards (But they should be Replicated* tables)
Set up Clickhouse Keeper
See This part of Reference 1
Create tables
Every time we want to create a table with a cluster1
topology, we should use the ON CLUSTER
statement. For example, if we want to create the table:
The first parameter of the ReplicatedMergeTree
engine is a Clickhouse keeper path of the table and the second one is a replica name. Tables with the same path and different replica names are replicated (Clickhouse keeper does these things).
We should define {shard}
and {replica}
as macros
in each node.
Once mydb.my_table
is created in each of host1
, host2
, host3
, or host4
it will create in all other nodes.
And finally, we will create a distributed table
to better query on my_db.my_table
which has 2 shards.
References
- Great video by Clickhouse in youtube
Posted on May 17, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.