Apache Kafka: A brief beginner's guide part 1
Sumit Negi
Posted on February 13, 2024
Kafka: An Open-Source Distributed Event Streaming Platform
Kafka is an open-source distributed event streaming platform that enables publishing, subscribing, storing, and processing streams of records in real-time. It serves as a highly scalable, fault-tolerant, and durable messaging system, commonly used for building real-time data pipelines, streaming analytics, and event-driven architectures.
Core Concepts of Kafka
At its core, Kafka is a distributed publish-subscribe messaging system. Data is written to Kafka topics by producers and consumed from those topics by consumers. Kafka topics can be partitioned, enabling parallel data processing, and can be replicated across multiple brokers for fault tolerance.
Understanding Events and Event Streaming in Kafka
Apache Kafka serves as a distributed event streaming platform used for building real-time data pipelines and streaming applications. Events and event streaming are fundamental concepts in Kafka:
Events
In Kafka, an event represents a piece of data, often in the form of a message, produced by a publisher (producer) and consumed by one or more subscribers (consumers). Events are typically small units of data, such as log entries or transactions. They are immutable, meaning once produced, they cannot be modified. Instead, new events can be produced to reflect updates or changes.
Event Streaming
Event streaming involves continuously capturing, processing, and distributing events in real-time. Kafka provides a distributed, fault-tolerant, and scalable infrastructure for event streaming. It allows applications to publish events to topics and consume events from topics asynchronously, facilitating various use cases such as real-time analytics, log aggregation, and messaging systems.
Key Concepts Related to Events and Event Streaming in Kafka
Brokers
Brokers are servers in the Kafka storage layer that store event streams from one or more sources. A Kafka cluster typically consists of several brokers. Each broker in a cluster serves as a bootstrap server, enabling connections to every broker in the cluster.
Producers
Producers are applications that generate events and publish them to Kafka topics. They determine the topics to which events will be sent and how events are partitioned among different partitions within a topic.
Consumers
Consumers retrieve events from Kafka topics. Each consumer maintains its position within a topic using metadata called the offset. Consumers can read events sequentially or move to specific offsets to reprocess past data.
Topics
Topics are categories or channels to which events are published. Each topic consists of partitions, allowing parallel processing and scalability. Events within topics are immutable and organized for efficient storage and retrieval.
Partitions
Partitions are segments of a topic where events are stored in an ordered sequence. Each partition is replicated across multiple Kafka brokers for fault tolerance.
Offsets
Offsets are unique identifiers assigned to each event within a partition. They track the position of consumers within a partition.
Consumer Groups
Consumer groups collectively consume events from one or more topics. Each consumer group maintains its offset for each partition it consumes from, enabling parallel event processing.
Replication
Replication ensures data redundancy and fault tolerance. Kafka replicates topics across brokers, ensuring multiple copies of data for resilience.
Installation Steps
- Install Docker and Docker Compose.
- git clone kafka docker repo
- cd kafka-stack-docker-compose.
- Run:
docker-compose -f zk-single-kafka-single.yml up -d
- Check to make sure both the services are running: docker-compose -f zk-single-kafka-single.yml ps
Node.js Coding
Create a kafkaClient.js file:
import { Kafka } from "kafkajs"
export const kafka = new Kafka({
clientId: 'my-app',
brokers: ['localhost:9092'],
})
This code initializes a Kafka client using the kafkajs
library, specifying a client ID and broker address for connecting to a Kafka cluster.
producer.js:
import { kafka } from "./kafkaClient.js"
import { Partitioners } from "kafkajs"
const producer = kafka.producer({
createPartitioner: Partitioners.LegacyPartitioner
})
await producer.connect()
setInterval(async ()=>{
await producer.send({
topic: 'test-topic',
messages: [
{ value: `Hello, what's up ${new Date()}` },
],
})
},3000)
This code imports a Kafka client from a module, sets up a producer with a legacy partitioner, connects it to Kafka, then sends messages to the 'test-topic' topic every 3 seconds with a timestamp.
consumer.js:
import { kafka } from "./kafkaClient.js"
const consumer = kafka.consumer({ groupId: 'test-group' })
await consumer.connect()
await consumer.subscribe({ topic: 'test-topic', fromBeginning: true })
await consumer.run({
eachMessage: async ({ topic, partition, message }) => {
console.log({
value: message.value.toString(),
})
},
})
This code imports a Kafka client from a module, sets up a consumer with a specified group ID, connects it to Kafka, subscribes to the 'test-topic' topic from the beginning, then runs a loop to handle incoming messages, logging their values to the console.
Thank you for reading. Have a wonderful day!
Posted on February 13, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.