Apache Kafka: What is and how it works
joaosczip
Posted on March 2, 2021
Introduction
When we start to study microservices, many different concepts, patterns, protocols, and tools are presented to us, like Messaging, AMQP, RabbitMQ, Event-sourcing, gRPC, CQRS, and many others. Among so many different words, there is one that always caught my attention and I always wanted to start studying it, the Apache Kafka.
The problem is that people always have a misunderstanding about what Kafka really is and is very common to see people talking about it like it as if it were just another messaging system. In fact, you can do that, but use Kafka only for that may be a waste of resources.
For that, my goal with this article is to explain how Kafka works and why you may need to use it in your projects. I’ll start presenting the principal concepts, explain why it’s not just a messaging system, I hope that it may help people that are entering this world, as I’m.
What is Apache Kafka?
If you enter the Kafka website, you’ll find the definition of it right on the first page:
A distributed streaming platform
What is an “A distributed streaming platform”? First, we need to define what is a stream. For that, I have a definition that made me really understand it: Streams are just infinite data, data that never end. It just keeping arriving, and you can process it in real-time.
And distributed? Distributed means that Kafka works in a cluster, each node in the cluster is called Broker. Those brokers are just servers executing a copy of apache Kafka.
So, basically, Kafka is a set of machines working together to be able to handle and process real-time infinite data.
His distributed architecture is one of the reasons that made Kafka so popular. The Brokers is what makes it so resilient, reliable, scalable, and fault-tolerant. That’s why Kafka is so performer and secure.
But why there is this misconception that Kafka is just another messaging system? To respond to that answer, we need first to explain how messaging works.
Messaging
Messaging, very briefly, it’s just the act to send a message from one place to another. It has three principal actors:
- Producer: Who produces and send the messages to one or more queues;
- Queue: A buffer data structure, that receives (from the producers) and delivers messages (to the consumers) in a FIFO (First-In-First-Out) way. When a message is delivered, it’s removed forever from the queue, there’s no chance to get it back;
- Consumer: Who is subscribed to one or more queues and receives their messages when published.
And that is it, this is how the messaging works (very briefly, there’s a lot more). As you can see there’s nothing about streams, real-time, cluster (depending on what tool you chose, you can use a cluster too, but it’s not native, like Kafka).
Kafka Architecture
Now that we know how messaging works, let’s get dive into the Kafka world. In Kafka, we also have the Producers and Consumers, they work in a very similar way that they work in the messaging, both producing and consuming messages.
https://www.cloudkarafka.com/blog/2016-11-30-part1-kafka-for-beginners-what-is-apache-kafka.html
As you can see it is very similar to what we’ve discussed about messaging, but here we don’t have the Queue concept, instead, we have the concept of Topics.
The Topic is a very specific type of data stream, it’s very similar to a Queue, it receives and delivers messages as well, but there are some concepts that we need to understand about topics:
A topic is divided into partitions, each topic can have one or more partitions and we need to specify that number when we're creating the topic. You can imagine the topic as a folder in the operation system and each folder inside her as a partition;
If we don’t give any key to a message when we’re producing it, by default, the producers will send the message in a round-robin way, each partition will receive a message (even if they are sent by the same producer). Because of that, we aren’t able to guarantee the delivery order at the partition level, if we want to send a message always to the same partition, we need to give a key to our messages. This will ensure that the message will always be sent to the same partition;
Each message will be stored in the broker disk and will receive an offset (unique identifier). This offset is unique at the partition level, each partition has its owns offsets. That is one more reason that makes Kafka so special, it stores the messages in the disk (like a database, and in fact, Kafka is a database too) to be able to recover them later if necessary. Different from a messaging system, that the message is deleted after being consumed;
The producers use the offset to read the messages, they read from the oldest to the newest. In case of consumer failure, when it recovers, will start reading from the last offset.
The partitions of a topic — https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2018/07/Kafka-Topic-Partitions-Layout.png
Brokers
As said before, Kafka works in a distributed way. A Kafka cluster may contain many brokers as needed.
Each broker in a cluster is identified by an ID and contains at least one partition of a topic. To configure the number of the partitions in each broker, we need to configure something called Replication Factor when creating a topic. Let’s say that we have three brokers in our cluster, a topic with three partitions and a Replication Factor of three, in that case, each broker will be responsible for one partition of the topic.
The Replication Factor — https://fullcycle.com.br/apache-kafka-trabalhando-com-mensageria-e-real-time/
As you can see in the above image, Topic_1 has three partitions, each broker is responsible for a partition of the topic, so, the Replication Factor of the Topic_1 is three.
It’s very important that the number of the partitions match the number of the brokers, in this way, each broker will be responsible for a single partition of the topic.
To ensure the reliability of the cluster, Kafka enters with the concept of the Partition Leader. Each partition of a topic in a broker is the leader of the partition and can exist only one leader per partition. The leader is the only one that receives the messages, their replicas will just sync the data (they need to be in-sync to that). It will ensure that even if a broker goes down, his data won’t be lost, because of the replicas.
When a leader goes down, a replica will be automatically elected as a new leader by Zookeeper.
Partition Leader — https://www.educba.com/kafka-replication/
In the above image, Broker 1 is the leader of Partition 1 of Topic 1 and has a replica in Broker 2. Let’s say that Broker 1 dies, when it happens, Zookeeper will detect that change and will make Broker 2 the leader of Partition 1. This is what makes the distributed architecture of Kafka so powerful.
Producers
Just like in the messaging world, Producers in Kafka are the ones who produce and send the messages to the topics.
As said before, the messages are sent in a round-robin way. Ex: Message 01 goes to partition 0 of Topic 1, and message 02 to partition 1 of the same topic. It means that we can’t guarantee that messages produced by the same producer will always be delivered to the same topic. We need to specify a key when sending the message, Kafka will generate a hash based on that key and will know what partition to deliver that message.
That hash takes into consideration the number of the partitions of the topic, that’s why that number cannot be changed when the topic is already created.
When we are working with the concept of messages, there’s something called Acknowledgment (ack). The ack is basically a confirmation that the message was delivered. In Kafka, we can configure this ack when producing the messages. There are three different levels of configuration for that:
ack = 0: When we configure the ack = 0, we’re saying that we don’t want to receive the ack from Kafka. In case of broker failure, the message will be lost;
ack = 1: This is the default configuration, with that we’re saying that we want to receive an ack from the leader of the partition. The data will only be lost if the leader goes down (still there’s a chance);
ack = all: This is the most reliable configuration. We are saying that we want to not only receive a confirmation from the leader but from their replicas as well. This is the most secure configuration since there’s no data loss. Remembering that the replicas need to be in-sync (ISR). If a single replica isn’t, Kafka will wait for the sync to send back de ack.
ack = all — https://static.javatpoint.com/tutorial/kafka/images/apache-kafka-producer5.png
Consumers and Consumers Groups
Consumers are applications subscribed to one or more topics that will read messages from there. They can read from one or more partitions.
When a consumer reads from just one partition, we can ensure the order of the reading, but when a single consumer reads from two or more partitions, it will read in parallel, so, there’s no guarantee of the reading order. A message that came later can be read before another that came earlier, for example.
That’s why we need to be careful when choosing the number of partitions and when producing the messages.
Another important concept of Kafka is the Consumer Groups. It’s really important when we need to scale the messages reading.
It becomes very costly when a single consumer needs to read from many partitions, so, we need o load-balancing this charge between our consumers, this is when the consumer groups enter.
The data from a single topic will be load-balancing between the consumers, with that, we can guarantee that our consumers will be able to handle and process the data.
The ideal is to have the same number of consumers in a group that we have as partitions in a topic, in this way, every consumer read from only one. When adding consumers to a group, you need to be careful, if the number of consumers is greater than the number of partitions, some consumers will not read from any topic and will stay idle.
Consumer Groups — https://docs.cloudera.com/cdp-private-cloud-base/latest/kafka-developing-applications/topics/kafka-develop-groups-fetching.html
Conclusion
Kafka is one of the most completes and performer tools that we have today. After studying it and get their principal concepts, I realized that is much more than just a messaging system. Your distributed architecture and real-time processing are what make it so adaptable to very different scenarios and use cases.
With this article, I tried to dive deep into Kafka, presenting their principal concepts to make it easier to understand and bring their differences from the traditional messaging systems.
Because of the time, I left some important concepts out of this article, like idempotent producers, compression, transactions, etc, but I really hope that this helps you to study and explore more this world.
References
Posted on March 2, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.