[BTY#3] 24/11/2023 Kafka Fundamentals

Kafka Tutorial (P1)

I. What is Apache Kafka?

Apache Kafka is an open-source distributed system consisting of servers and clients.
Apache Kafka is used primarily to build real-time data streaming pipelines. ##### Challenges of Data Integration

Kafka to the rescue

Use cases of Apache Kafka:

Messaging systems
Activity Tracking
Gather metrics from many different locations, for example, IoT devices
Application logs analysis
De-coupling of system dependencies
Integration with Big Data technologies like Spark, Flink, Storm, Hadoop.
Event-sourcing store

II. Kafka Fundamentals

1. Kafka Topics

What is Kafka Topic?

Kafka uses the concept of topics to organize related messages.
A topic is identified by its name.

Kafka topics can contain any kind of message in any format, and the sequence of all these messages is called a data stream.
Data in Kafka topics is deleted after one week by default (also called the default message retention period), and this value is configurable.

What is Kafka Partitions?

Topics are broken down into a number of partitions. A single topic may have more than one partition.
The number of partitions of a topic is specified at the time of topic creation.
Partitions are numbered starting from 0 to N-1, where N is the number of partitions.

The offset is an integer value that Kafka adds to each message as it is written into a partition. Each message in a given partition has a unique offset.

Kafka Topic example

We can create a topic named - trucks_gps to which the trucks publish their positions.
Each truck may send a message to Kafka every 20 seconds, each message will contain the truck ID and the truck position (latitude and longitude).

What is Kafka Offsets?

Apache Kafka offsets represent the position of a message within a Kafka Partition.
Offset numbering for every partition starts at 0 and is incremented for each message sent to a specific Kafka partition.
Even though messages in Kafka topics are deleted over time, the offsets are not re-used. They continually are incremented in a never-ending sequence.

2. Kafka Producers

What is Kafka Producers?

Kafka producers send data into Kafka topics and messages are distributed to partitions according to a mechanism such as key hashing (more on it below).
For a message to be successfully written into a Kafka topic, a producer must specify a level of acknowledgment (acks).

Message Keys

Kafka message keys are commonly used when there is a need for message ordering for all messages sharing the same field.
Each event message contains an optional key and a value.
If key=null, messages are sent in a round-robin fashion (p0 -> p1 -> p3 -> ... -> p0 -> p1 -> ...).
If key!=null, all messages that share the same key will always be sent and stored in the same Kafka partition.
A key can be anything to identify a message - a string, numeric value, binary value, etc.

Kafka Message Anatomy

Key. Key is optional in the Kafka message and it can be null. A key may be a string, number, or any object and then the key is serialized into binary format.
Value. The value represents the content of the message and can also be null. The value format is arbitrary and is then also serialized into binary format.
Compression Type. Kafka messages may be compressed. The compression type can be specified as part of the message. Options are none, gzip, lz4, snappy, and zstd
Headers. There can be a list of optional Kafka message headers in the form of key-value pairs. It is common to add headers to specify metadata about the message, especially for tracing.
Partition + Offset. Once a message is sent into a Kafka topic, it receives a partition number and an offset id. The combination of topic+partition+offset uniquely identifies the message
Timestamp. A timestamp is added either by the user or the system in the message.

Kafka Message Serializers

Kafka Message Key Hashing

In the default Kafka partitioner, the keys are hashed using the murmur2 algorithm.

targetPartition = Math.abs(Utils.murmur2(keyBytes)) % (numPartitions - 1)

3. Kafka Consumers

What is Kafka Consumers?

Kafka Consumers pull data from Kafka topics.
Consumers can read from one or more partitions at a time in Apache Kafka, and data is read in order within each partition.

A consumer always reads data from a lower offset to a higher offset and cannot read data backwards.
If the consumer consumes data from more than one partition, the message order is not guaranteed across multiple partitions because they are consumed simultaneously, but the message read order is still guaranteed within each individual partition.
By default, Kafka consumers will only consume data that was produced after it first connected to Kafka.

Kafka Message Deserializers

4. Kafka Consumer Groups & Offsets

Kafka Consumer Groups

Consumers that are part of the same application and therefore performing the same "logical job" can be grouped together as a Kafka consumer group.
The benefit of leveraging a Kafka consumer group is that the consumers within the group will coordinate to split the work of reading from different partitions.

If there are more consumers than the number of partitions of a topic => some of consumers remain inactive.

Kafka Consumer Offsets

Kafka brokers use an internal topic named __consumer_offsets that keeps track of what messages a given consumer group last successfully processed.
Therefore, in order to "checkpoint" how far a consumer has been reading into a topic partition, the consumer will regularly commit the latest processed message, also known as consumer offset.
In the figure below, a consumer from the consumer group has consumed messages up to offset 4262, so the consumer offset is set to 4262.

The process of committing offsets is not done for every message consumed (because this would be inefficient), and instead is a periodic process.
This also means that when a specific offset is committed, all previous messages that have a lower offset are also considered to be committed.

Why use Consumer Offsets?

If a Kafka client crashes, a rebalance occurs and the latest committed offset help the remaining Kafka consumers know where to restart reading and processing messages.
In case a new consumer is added to a group, another consumer group rebalance happens and consumer offsets are yet again leveraged to notify consumers where to start reading data from.

5. Kafka Brokers

What is a Kafka Broker?

A single Kafka server is called a Kafka Broker.
That Kafka broker is a program that runs on the Java Virtual Machine (Java version 11+).

What is Kafka Cluster?

An ensemble of Kafka brokers working together is called a Kafka cluster.
A broker in a cluster is identified by a unique numeric ID.

Kafka Brokers and Topics

Kafka brokers store data in a directory on the server disk they run on. Each topic-partition receives its own sub-directory with the associated name of the topic.
To achieve high throughput and scalability on topics, Kafka topics are partitioned. If there are multiple Kafka brokers in a cluster, then partitions for a given topic will be distributed among the brokers evenly, to achieve load balancing and scalability.

How do clients connect to a Kafka Cluster (bootstrap server)?

A client that wants to send or receive messages from the Kafka cluster may connect to any broker in the cluster.
Every broker in the cluster has metadata about all the other brokers and will help the client connect to them as well, and therefore any broker in the cluster is also called a bootstrap server.
The bootstrap server will return metadata to the client that consists of a list of all the brokers in the cluster. Then, when required, the client will know which exact broker to connect to to send or receive data, and accurately find which brokers contain the relevant topic-partition.

In practice, it is common for the Kafka client to reference at least two bootstrap servers in its connection URL, in the case one of them not being available, the other one should still respond to the connection request. That means that Kafka clients (and developers / DevOps) do not need to be aware of every single hostname of every single broker in a Kafka cluster, but only to be aware and reference two or three in the connection string for clients.

6. Kafka Topic Replication

One of the main reasons for Kafka's popularity, is the resilience it offers in the face of broker failures. Machines fail, and often we cannot predict when that is going to happen or prevent it. Kafka is designed with replication as a core feature to withstand these failures while maintaining uptime and data accuracy.

Kafka Topic Replication Factor

Data Replication helps prevent data loss by writing the same data to more than one broker.
In Kafka, replication means that data is written down not just to one broker, but many.

Thanks to a replication factor of 2, we can withstand the failure of one broker. This means that if Broker 102 failed, as you see below, Broker 101 & 103 would still have the data.

What are Kafka Partitions Leader and Replicas?

For a given topic-partition, one Kafka broker is designated by the cluster to be responsible for sending and receiving data to clients.
Each partition has one leader and multiple replicas.

What are In-Sync Replicas (ISR)?

An ISR is a replica that is up to date with the leader broker for a partition.

Here we have Broker 101 as Partition 0 leader and Broker 102 as the leader of Partition 1. Broker 102 is a replica for Partition 0 and Broker 103 is a replica for Partition 1. If the leader broker were to fail, one of the replicas will be elected as the new partition leader by an election.

Kafka producers acks setting

acks=0: When acks=0 producers consider messages as "written successfully" the moment the message was sent without waiting for the broker to accept it at all. => If the broker goes offline or an exception happens, we won’t know and will lose data. This is useful for data where it’s okay to potentially lose messages, such as metrics collection, and produces the highest throughput setting because the network overhead is minimized.

acks=1: When acks=1 , producers consider messages as "written successfully" when the message was acknowledged by only the leader. => If an ack is not received, the producer may retry the request. If the leader broker goes offline unexpectedly but replicas haven’t replicated the data yet, we have a data loss.

acks=all: When acks=all, producers consider messages as "written successfully" when the message is accepted by all in-sync replicas (ISR). => The lead replica for a partition checks to see if there are enough in-sync replicas for safely writing the message (controlled by the broker setting min.insync.replicas). The request will be stored in a buffer until the leader observes that the follower replicas replicated the message, at which point a successful acknowledgement is sent back to the client.

Kafka Topic Durability & Availability

As a general rule, for a replication factor of N, you can permanently lose up to N-1 brokers and still recover your data.

Preferred leader

The preferred leader is the designated leader broker for a partition at topic creation time (as opposed to being a replica).
When the preferred leader goes down, any partition that is an ISR (in-sync replica) is eligible to become a new leader (but not a preferred leader). Upon recovering the preferred leader broker and having its partition data back in sync, the preferred leader will regain leadership for that partition.

7. Zookeeper with Kafka

What is Zookeeper in Kafka and what does Zookeeper do?

Zookeeper keeps track of which brokers are part of the Kafka cluster.
Zookeeper is used by Kafka brokers to determine which broker is the leader of a given partition and topic and perform leader elections
Zookeeper stores configurations for topics and permissions
Zookeeper sends notifications to Kafka in case of changes (e.g. new topic, broker dies, broker comes up, delete topics, etc.…).
A Zookeeper cluster is called an ensemble. It is recommended to operate the ensemble with an odd number of servers, e.g., 3, 5, 7, as a strict majority of ensemble members (a quorum) must be working in order for Zookeeper to respond to requests.
Zookeeper has a leader to handle writes, the rest of the servers are followers to handle reads.

Should you use Zookeeper with Kafka brokers?

As long as Kafka without Zookeeper is not production ready, you must use Zookeeper in your production deployments for Apache Kafka.

Should you use Zookeeper with Kafka clients?
To be a great modern-day Kafka developer, never ever use Zookeeper as a configuration in your Kafka clients, and other programs that connect to Kafka.

8. Kafka KRaft Mode

Why remove Zookeeper from Kafka?

Kafka scaling has hit a performance bottleneck with Zookeeper, which means Kafka has the following limitations with Zookeeper:

Kafka clusters only support a limited number of partitions (up to 200,000)
When a Kafka broker joins or leaves a cluster, a high number of leader election must happen which can overload Zookeeper and slow down the cluster temporarily
Kafka clusters setup is difficult and depends on another component to setup
Kafka cluster metadata is sometimes out-of-sync from Zookeeper
Zookeeper security is lagging behind Kafka security ##### Kafka KRaft Mode Removing Zookeeper means that Kafka must still act as a quorum to perform controller election and therefore the Kafka brokers implement the Raft protocol thus giving the name KRaft to the new Kafka Metadata Quorum mode.

Without Zookeeper, the following benefits are observed in Kafka:

Ability to scale to millions of partitions, easier to maintain and set up
Improved stability, easier to monitor, support, and administer
Single process to start Kafka
Single security model for the whole system
Faster controller shutdown and recovery time

Ref: benefits of KRaft - Confluent blog