Learning Kafka Part Four (II): Kafka Streams

ibrahim_anis

Muhammad Ibrahim Anis

Posted on January 12, 2023

Learning Kafka Part Four (II): Kafka Streams

While moving streaming data from a source system to target system is important, sometimes we might want to process, transform or react to the data immediately it becomes available. This is referred to as stream processing. We can of course use a dedicated stream processing platforms like Apache Storm or Apache Flink, but remember the problem with Zookeeper? Right, we have to set up, integrate and maintain two different systems. What we need is a way to process the data in Kafka itself without an added overhead or complexity. Enter Kafka streams.

Kafka Streams

Kafka Streams is an easy, lightweight, yet powerful data processing Java library within Kafka. With Kafka streams, we dont't just move data to and from Kafka but process these data in real time.

Kafka streams functions as a mix of a Producer and Consumer, we read data from a topic, process or enrich the data, then write the processed data back to Kafka (i.e., to make the data available to downstream consumers).

Image of kafka streams

Features

  • Small and lightweight
  • Already Integrated with Kafka
  • Applications built with it are just normal Java Applications

How it Works

Topology

A Kafka Streams application is structured as a DAG (Directed Acyclic Graph), called a topology.

Image of a DAG

A Kafka Streams application can be made up of one or more topologies.
A topology defines the computational logic to be applied to a stream of data. A node in a topology represents a single processor and the edges represent streams of data.

There are three types of processors in a topology;

  • Source processor: a processor that does not have any upstream processor. It produces an input stream to its topology by consuming messages from one or more topics and passing it to downstream processors.
  • Stream processor: represents a processing step to transform the data in the streams. It receives events from its upstream processor(s), which is either a source or stream processor, apply its transformation, and pass it on to its downstream processor, which can be another stream processor or a sink processor.
  • Sink processor: another special type of processor that does not have a downstream processor. It sends events received from upstream processors to a specified Kafka topic.

Image of a topology

Scenario:
We have a producer that sends the details of orders placed by customers to a Kafka topic, we process these orders, to check if they are valid or not, our topology might look like this;

Image of order topology

Note:

  • **Order placed **is a source processor, its job is to read stream data from a Kafka topic.
  • Is valid is a stream processor, responsible for checking if the transaction is valid or not.
  • Valid is a sink processor that writes all the valid orders to a Kafka topic.
  • Not Valid is also a sink processor that writes the error responses back to a different Kafka topic, where a consumer will read and forward the error messages to customers.

Sub-topology

Kafka also has the concept of sub-topology.
Scenario continued:
The above topology is reading from a single topic, process the data and, based on the condition, write back to two different topics. But what if we want to build a topology that further reads from the Valid order topic, and apply further business logic on the data? Like check if the items ordered are available? our topology might now look like this;

Image of order subtopology

Note:

  • Valid which is a sink processor for the topology is now the source processor for our sub-topology.
  • Items Available is a stream processor that checks whether the items are available.
  • Yes and No are both sink processors that writes to different Kafka topics.

In Kafka streams, data are processed one after another, in the order they arrived. Only a single data is routed through a topology at a time. The data is passed through each processor in a topology before another is allowed in.

Image of routed data

Kafka streams provides two ways to represent stream processing topology:

  • Kafka Streams Processor API: this is the lower-level API that provides us with flexibility and fine control over how we define our processing topology, but comes with more manual coding.
  • Kafka Streams DSL (Domain Specific Language) API: this is built on top of the Streams Processor API. Provides more abstraction. It is the recommended API for most users because most data processing use cases can be expressed with it. Very similar to Java Streams API. We typically use built in operations like map, filter, mapValues etc to process data.

Stream Processing Concepts

When talking about Kafka streams, or any other stream processing framework, there some important concepts we need to be aware of, like stateless processing, stateful processing, time and windowing;

Stateless processing

In Stateless processing, the event is processed independently of other events. The stream application requires no previous memory of the event. It only needs to look at the current event and perform and operation. Example is the filter operation, it requires no previous memory of the event, it checks the current event and determine if it should pass on to the next processor or not. If our topology is made up of only stateless processors, then our streams application is considered a stateless application.

Stateful processing

Stateful processing on the other hand requires the previous state of the event. It needs to remember information (state) about previously seen event in one or more steps in our topology. Example is the count operation; it requires the number of all previously seen events to keep track of how many they are. If our topology has one stateful processor, regardless of the number of stateless processors, then it is considered to be a stateful application.

Windowing

We can never hope to get a global view of a data stream, as theoretically, it is endless. Windowing allows us to slice up the endless stream of events into chunks or segments called windows for processing. Slicing can either be time-based (events from the last five minutes, two hours or three days) or count based (the last one hundred or five thousand events).

Windowing strategy
There are four types of windowing strategy in Kafka streams;

Tumbling Window
A tumbling window is a fixed, non-overlapping and gapless window characterized by size, usually a time interval. we process the stream data after the time interval. For example, if we configure a tumbling window with a size of five minutes, all the events within the same five-minute window will be grouped together to be processed. Events will only belong to one and only one window.

Hopping window
Hopping window is a type of tumbling window but with advance interval added. The advance interval determines how much the window moves forward (i.e., hop) relative to its previous position. This causes overlaps between windows and events may belong to more than one window.

Sliding window
Sliding window is another time-based window, but its endpoint is determined by user activity. For example, when two events are within a predetermined timeframe, they will be included in the same window.

Session window
A session window is defined by a period of activity separated by a gap of inactivity. This can be used to group events using the event keys. Session window typically has a timeout, which is the maximum duration a session stays open. If there are no events under the key received within this duration, the session is closed and processed. Next time when there is an event under the key, a new session will be opened. Session window differ from the previous types because they aren’t defined by time, but user activity. This can be useful for behavioural analysis like when we want to keep track of a particular user’s activity.

Time

Time is a very important factor in Kafka streams. During processing, each events needs to be associated with a timestamp. Remember when we said slicing can be time based? Which time did we mean? There are three different notions of time in stream processing;

  • Event time: the time when an event happened at the source. Usually embedded in the events payload
  • Ingestion time: when an event is added to a topic on a broker. This always occurs after event time
  • Processing time: when our Kafka streams application process the event. Always happens after event time and ingestion time.

That’s it. The end of this segment on Kafka Streams.
Up next, connecting Kafka to external systems with Kafka Connect.

💖 💪 🙅 🚩
ibrahim_anis
Muhammad Ibrahim Anis

Posted on January 12, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related