Data Engineering Zoomcamp Week 6 - Streaming using kafka
Okibaba
Posted on April 8, 2024
This past couple weeks I spent some time learning about Kafka in week 6 of my data engineering zoomcamp.
Apache Kafka is a distributed streaming platform that has gained immense popularity in recent years due to its ability to handle large-scale, real-time data feeds. It provides a reliable and scalable solution for building streaming data pipelines and applications.
Grokking kafka requires getting familiar with some of its key architectural abstractions.
Kafka Architecture:
Kafka follows a publish-subscribe model (pub sub) , where producers send messages to topics, and consumers read messages from those topics. The architecture consists of the following main components:
-
Producers:
- Producers are responsible for publishing messages to Kafka topics.
- They can choose to send messages to specific partitions within a topic.
- Producers have the ability to control the partition assignment using keys.
-
Consumers:
- Consumers are the subscribers who read messages from Kafka topics.
- They are organized into consumer groups, identified by a unique consumer group ID.
- Each consumer within a group reads from a specific partition of a topic.
-
Topics:
- Topics are the fundamental unit of organization in Kafka.
- They are used to categorize and store streams of records.
- Topics are partitioned, allowing multiple consumers to read from different partitions simultaneously.
-
Partitions:
- Topics are divided into partitions, which are the smallest storage units in Kafka.
- Each partition is an ordered, immutable sequence of records.
- Partitions enable parallel processing and horizontal scalability.
-
Cluster:
- Kafka runs as a cluster of one or more servers called brokers.
- The cluster is responsible for storing and managing the topics and their partitions.
- Kafka ensures fault tolerance and high availability through replication.
Kafka Configuration:
Kafka provides various configuration options to control its behavior and performance:
-
Replication Factor:
- The replication factor determines the number of copies of each partition across the Kafka cluster.
- It ensures fault tolerance and data durability.
- A higher replication factor provides better reliability but increases storage overhead.
-
Retention:
- Retention refers to how long Kafka retains messages within a topic.
- It can be configured based on time (e.g., retaining messages for a specific number of days) or size (e.g., retaining a certain amount of data).
- Retention policies help manage storage space and comply with data retention requirements.
-
Offsets:
- Offsets represent the position of a consumer within a partition.
- Consumers keep track of the offsets to know which messages they have already processed.
- Kafka provides different offset management strategies, such as automatic offset commits or manual offset control.
-
Auto Offset Reset:
- The auto offset reset configuration determines the behavior when a consumer starts reading from a topic without a committed offset.
- It can be set to "earliest" (start from the beginning) or "latest" (start from the most recent message).
-
Acknowledgment (ACK):
- Acknowledgment settings control the reliability of message delivery.
- Producers can wait for acknowledgments from the Kafka brokers to ensure that messages are persisted.
- The "acks" configuration allows trade-offs between latency and durability.
Conclusion:
Apache Kafka's distributed architecture, pub-sub model, and configurable options make it a powerful tool for building scalable and fault-tolerant streaming applications. Kafka's capabilities to process and analyze real-time data streams efficiently explains why its heavily used in real time data engineering and machine learning work flow.
Posted on April 8, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.