Beginner Kafka tutorial: Get started with distributed systems

erineducative

Erin Schaffer

Posted on August 4, 2021

Beginner Kafka tutorial: Get started with distributed systems

Distributed systems are collections of computers that work together to form a single computer for end-users. They allow us to scale at exponential rates, and they can handle billions of requests and upgrades without downtime. Apache Kafka has become one of the most widely used distributed systems on the market today.

According to the official Kafka site, Apache Kafka is an “open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.” Kafka is used by most Fortune 100 companies, including big tech names like LinkedIn, Netflix, and Microsoft.

In this Apache Kafka tutorial, we’ll discuss the uses, key features, and architectural components of the distributed streaming platform. Let’s get started!

We’ll cover:

What is Kafka?

Apache Kafka is an open-source software platform written in the Scala and Java programming languages. Kafka started in 2011 as a messaging system for LinkedIn but has since grown to become a popular distributed event streaming platform. The platform is capable of handling trillions of records per day.

Kafka is a distributed system comprised of servers and clients that communicate through a TCP network protocol. The system allows us to read, write, store, and process events. We can think of an event as an independent piece of information that needs to be relayed from a producer to a consumer. Some relevant examples of this include Amazon payment transactions, iPhone location updates, FedEx shipping orders, and much more. Kafka is primarily used for building data pipelines and implementing streaming solutions.

Kafka allows us to build apps that can constantly and accurately consume and process multiple streams at very high speeds. It works with streaming data from thousands of different data sources. With Kafka, we can:

  • process records as they occur
  • store records accurately and consistently
  • publish or subscribe to data or event streams

The Kafka publish-subscribe messaging system is extremely popular in the Big Data scene and integrates well with Apache Spark and Apache Storm.

Kafka use cases

You can use Kafka in many different ways, but here are some examples of different use cases shared on the official Kafka site:

  • Processing financial transactions in real-time
  • Tracking and monitoring transportation vehicles in real-time
  • Capturing and analyzing sensor data
  • Collecting and reacting to customer interactions
  • Monitoring hospital patients
  • Providing a foundation for data platforms, event-driven architectures, and microservices
  • Performing large-scale messaging
  • Serving as a commit-log for distributed systems
  • And much more

Key features of Kafka

Let’s take a look at some of the key features that make Kafka so popular:

  • Scalability: Kafka manages scalability in event connectors, consumers, producers, and processors.

  • Fault tolerance: Kafka is fault-tolerant and easily handles failures with masters and databases.

  • Consistent: Kafka can scale across many different servers and still maintain the ordering of your data.

  • High performance: Kafka has high throughput and low latency. It remains stable even when working with a multitude of data.

  • Extensibility: Many different applications have integrations with Kafka.

  • Replication capabilities: Kafka uses ingest pipelines and can easily replicate events.

  • Availability: Kafka can stretch clusters over availability zones or connect different clusters across different regions. Kafka uses ZooKeeper to manage clusters.

  • Connectivity: The Kafka Connect interface allows you to integrate with many different event sources such as JMS and AWS S3.

  • Community: Kafka is one of the most active projects in the Apache Software Foundation. The community holds events like the Kafka Summit by Confluent.

Components of Kafka architecture

Before we dive into some of the components of the Kafka architecture, let's take a look at some of the key concepts that will help us understand it:

Kafka Consumer Groups

Consumer groups consist of a cluster of related consumers that perform certain tasks, such as sending messages to a service. They can run multiple processes at one time. Kafka sends messages from partitions of a topic to the consumers in the group. When the messages are sent to the group, each partition is read by a single consumer within the larger group.

Kafka Partitions

Kafka topics are divided into partitions. These partitions are reproduced across different brokers. Within each partition, multiple consumers can read from a topic simultaneously.

Topic Replication Factor

The topic replication factor ensures that data remains accessible and that deployment runs smoothly and efficiently. If a broker goes down, topic replicas on different brokers stay within those brokers to make sure we can access our data.

Kafka Topics

Topics help us organize our messages. We can think of them as channels that our data goes through. Kafka producers can publish messages to topics, and Kafka consumers can read messages from topics that they are subscribed to.

Now that we’ve covered some foundational concepts, we’re ready to get into the architectural components!

Kafka APIs

Kafka has four essential APIs within its architecture. Let’s take a look at them!

Kafka Producer API

The Producer API allows apps to publish streams of records to Kafka topics.

Kafka Consumer API

The Consumer API allows apps to subscribe to Kafka topics. This API also allows the app to process streams of records.

Kafka Connector API

The Connector API connects apps or data systems to topics. This API helps us build and manage producers and consumers. It also enables us to reuse connections across different solutions.

Kafka Streams API

The Streams API allows apps to process data using stream processing. This API enables apps to take in input streams from different topics and process them with a stream processor. Then, the app can produce output streams and send them out to different topics.

Kafka Brokers

A single Kafka server is called a broker. Typically, multiple brokers operate as one Kafka cluster. The cluster is controlled by one of the brokers, called the controller. The controller is responsible for administrative actions like assigning partitions to other brokers and monitoring for failures and downtime.

Partitions can be assigned to multiple brokers. If this happens, the partition is replicated. This creates redundancy in case one of the brokers fails. A broker is responsible for receiving messages from producers and committing them to disk. Brokers also receive requests from consumers and respond with messages taken from partitions.

Here’s a visualization of a broker hosting several topic partitions:

Alt Text

Kafka Consumers

Consumers receive messages from Kafka topics. They subscribe to topics, then receive messages that producers write to a topic. Normally, each consumer belongs to a consumer group. In a consumer group, multiple consumers work together to read messages from a topic.

Let’s take a look at some of the different configurations for consumers and partitions in a topic:

Number of consumers and partitions in a topic are equal

In this scenario, each consumer reads from one partition.

Alt Text

Number of partitions in a topic is greater than the number of consumers in a group

In this scenario, some or all of the consumers read from more than one partition.

Alt Text

Single consumer with multiple partitions

In this scenario, all partitions are consumed by a single consumer.

Alt Text

Number of partitions in a topic is less than the number of consumers in a group

In this scenario, some of the consumers will be idle.

Alt Text

Kafka Producers

Producers write messages to Kafka that consumers can read.



Advanced concepts to explore next

Congrats on taking your first steps with Apache Kafka! Kafka is an efficient and powerful distributed system. Kafka's scaling capabilities allow it to handle large workloads. It's often the preferred choice over other message queues for real-time data pipelines. Overall, it's a versatile platform that can support many use cases. You're now ready to move on to some more advanced Kafka topics such as:

  • Producer serialization
  • Consumer configurations
  • Partition allocation

To get started learning these topics and a lot more, check out Educative's curated course Building Scalable Data Pipelines with Kafka. In this course, we'll introduce you to Kafka theory and provide you with a hands-on, interactive browser terminal to execute Kafka commands against a running Kafka broker. You'll learn more about the concepts we covered in this article, along with other important topics.

By the end, you'll have a stronger understanding of how to build scalable data pipelines with Apache Kafka.

Happy learning!

Continue reading about distributed systems and big data

💖 💪 🙅 🚩
erineducative
Erin Schaffer

Posted on August 4, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related