Apache Kafka Applications Can Work Without Apache Kafka Cluster?

If this article was interesting to you, follow @lenadroid on Twitter. If you prefer video format better, take a look at this video about the topic.

Apache Kafka and Azure Event Hubs are two different systems for managing events, that have the same goal in mind. Their aim is to provide distributed, reliable, fault-tolerant, persistent, scalable, and fast system for managing events, decoupling event publishers and subscribers, making it easier to build event-driven architectures.

Many projects already rely on Apache Kafka for event ingestion, because it has the richest ecosystem around it, many contributors, a variety of open-source libraries, connectors, and projects available.

Apache Kafka can run anywhere - in the cloud and on-premises. For example, one can run it on Azure using HDInsight for Apache Kafka, or deploy it on standard VMs.

One of the things we always have to keep in mind, there is infrastructure behind Apache Kafka that we have to maintain. Apache Kafka assumes of a cluster of broker VMs, which we need to manage. Sometimes we want to spend the least amount of time managing the infrastructure, but still, have a reliable backend for event-ingestion. This is the exact reason why someone might want to take a look at using Event Hubs for Apache Kafka ecosystems. You can keep using your existing Apache Kafka applications unchanged, and rely on Azure Event Hubs as a backend for your event-ingestion by just swapping the connection information. This allows to keep using Apache Kafka connectors and libraries to hundreds of projects and delegate the complexity to Azure Event Hubs behind the scenes to help you focus on code instead of maintaining infrastructure.

I'm confused, how can I use Apache Kafka and Event Hubs together? Event Hubs for Apache Kafka? What does it mean?

There are three parts we need to think about:

What is the system we work with on the backend - the one that collects events from producers and distributes it to subscribers? This could be Apache Kafka installed on pure VMs in your data center, Apache Kafka running in the cloud, or it can be Event Hubs - a managed service in Azure.
What is the application we have to work with the backend event-ingestion system? This can be an event producer, an event consumer, a command line application that connects to the backend event system.
How does the client event application talk to the backend event system? When the backend system is Apache Kafka, clients can talk to it using Apache Kafka API. When we decide to use Event Hubs, clients can talk to it using the standard Event Hubs API or using Apache Kafka API (thanks to Event Hubs for Kafka ecosystems feature).

This way, if you are already working with Apache Kafka, it can be easy to simplify management of your event infrastructure. You can keep using your existing Apache Kafka applications unchanged, and rely on Azure Event Hubs as a backend for your event-ingestion by just swapping the connection information. This allows us to keep using Apache Kafka connectors and libraries to hundreds of projects and delegate the complexity to Azure Event Hubs behind the scenes to help you focus on code instead of maintaining infrastructure.

A Real-World Example

Using Apache Kafka for event streaming scenarios is a very common use case. Frequently it is used together with Apache Spark - an event processing and distributed computing system.

There are many advantages of using Apache Kafka. One of them is the availability of so many useful libraries and connectors that let you send and receive events to and from a big variety of sources. Kafka ecosystem is incredibly rich and the community is very active.

Let’s take a look at a common architecture with Apache Kafka.

Apache Kafka acts as a data ingestion component, that receives data from some data producer. It can be data sent from sensors or other applications.

Apache Spark is a data processing system that receives the data and performs some processing logic with the data it receives in real-time.

There is nothing wrong with this architecture, and it’s very common. However, it can get complicated to run and manage your own Kafka clusters. Managing a Kafka cluster can become a full-time job. To make sure your Kafka cluster operates correctly and is performant, you usually have to tune and configure virtual machines that Kafka uses called brokers. When the cluster is scaled up or new topics are added you’d need to perform partition rebalancing. There are many similar things you’ll need to take care of.

Can We Simplify This?

One of the things you can do to optimize your architecture is to use a managed service that will eliminate the need for cluster maintenance. Event Hubs is a completely managed service in Azure that can ingest millions of events per second and costs 3 cents an hour. It is very similar to Apache Kafka in what its goal is. There are some differences in how they work. For example, with Event Hubs, you can use the Auto-Inflate feature to automatically adjust throughput according to workload spikes, and many more useful features.

In most cases, nobody wants to rewrite code and move to another service, and this is exactly the case with Event Hubs. Because Event Hubs protocol is binary compatible Apache Kafka, you can still use the code that you wrote that already works with Apache Kafka, and it will work with Event Hubs as well. This means you can still use your favorite Apache Kafka libraries, such as Spark-to-Kafka connector, and use Event Hubs as a backend for event ingestion and not ever think about cluster management again.

To start using Event Hubs with your existing Apache Kafka logic, all you need to do is change the configuration to point to Event Hubs instead of Kafka. If with Apache Kafka we use bootstrap servers to connect to it, with Event Hubs we’d use the public URL and connection string to connect.

As a result, we didn't change any logic for producer and consumer, we didn't change any libraries. With the only change in connection configuration, we can provide seamless migration to a completely managed service for event ingestion, and we can use it with many Apache Kafka clients, libraries, and existing applications.

Can You Show Us The Code?

Why, yes. Let’s take a look at how this can be done in-action!

Let's say, a company is already using Apache Kafka on HDInsight for their events, together with Apache Spark to process them.

They have an HDInsight Kafka cluster that looks like this:

They also have an Azure Databricks workspace that looks like this:

Their Spark cluster exists in the same virtual network as the Kafka cluster (using VNet Injection feature), and has the following Spark-Kafka connector library attached:

Let's generate the data for this use case. We can think of it as sensor data with a timestamp and some numerical indicator, which we generate using rate stream in Spark and send it to Kafka using the Spark-Kafka connector. Let's run it in Azure Databricks Notebook:



// MESSAGE PRODUCER LOGIC
val rates =
  spark
  .readStream
  .format("rate")
  .option("rowsPerSecond",10)
  .load

// MESSAGE PRODUCER CONFIGURATION
// CAN BE READ PURELY FROM CONFIGURATION FILES
// REPLACE TOPIC NAME AND BOOTSTRAP SERVERS WITH CORRECT VALUES
val TOPIC = "testtopic"
val BOOTSTRAP_SERVERS = "172.16.0.4:9092,172.16.0.6:9092,172.16.0.5:9092"

rates
    .select($"timestamp".alias("key"), $"value")
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .format("kafka")
    .option("topic", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("checkpointLocation", "/ratecheckpoint")
    .start()

The real-time event stream of sensor data is consumed by a separate notebook:



// CONSUMER CONFIGURATION
// REPLACE TOPIC NAME AND BOOTSTRAP SERVERS WITH CORRECT VALUES
val TOPIC = "testtopic"
val BOOTSTRAP_SERVERS = "172.16.0.4:9092,172.16.0.6:9092,172.16.0.5:9092"

val rates = spark.readStream
    .format("kafka")
    .option("subscribe", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("kafka.request.timeout.ms", "60000")
    .option("kafka.session.timeout.ms", "60000")
    .option("failOnDataLoss", "false")
    .option("startingOffsets", "latest")
    .load()

// PROCESSING LOGIC 
rates
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .outputMode("append")
    .format("console")
    .option("truncate", false).start().awaitTermination()

We use bootstrap servers to connect to the Apache Kafka cluster brokers in both, producer and consumer, and work with testtopic topic. On instructions for creating a topic in HDInsight Kafka and getting Kafka broker addresses, take a look at this document.

Swapping Apache Kafka backend with Event Hubs but leaving the code and libraries as is.

Now we want to start using Event Hubs, so we create a new Event Hubs with Apache Kafka feature enabled, and add a new testtopic hub.

To make the same code work with the new event backend, we only need to change the connection configuration in both producer and consumer. Instead of using IP addresses for brokers in bootstrap servers, we use Event Hubs endpoint. We also specify Event Hubs connection string. Because Event Hubs is a managed service, there is no cluster we need to manage, and the namespace (an alternative to the cluster in Apache Kafka terms) is just a container for topics. Scalability is managed using throughput units (1 TU = 1MB/sec, or 1000 events/sec) and can be adjusted automatically according to workload spikes.

Producer code:



// UNCHANGED MESSAGE PRODUCER LOGIC
val rates = 
    spark
    .readStream
    .format("rate")
    .option("rowsPerSecond",10)
    .load


import kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule
val TOPIC = "testtopic"

// NEW VALUE, REPLACE EVENTHUBSNAME WITH YOUR OWN VALUE
val BOOTSTRAP_SERVERS = "EVENTHUBSNAME.servicebus.windows.net:9093"

// NEW VALUE, REPLACE EVENTHUBSNAME, SECRETKEYNAME, SECRETKEYVALUE WITH YOUR OWN VALUES
val EH_SASL = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://EVENTHUBSNAME.servicebus.windows.net/;SharedAccessKeyName=SECRETKEYNAME;SharedAccessKey=SECRETKEYVALUE\";"

rates
    .select($"timestamp".alias("key"), $"value")
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .format("kafka")
    .option("topic", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.jaas.config", EH_SASL)
    .option("checkpointLocation", "/ratecheckpoint")
    .start()

Consumer code:



//CONSUMER CONFIGURATION 
val TOPIC = "testtopic"

// NEW VALUE, REPLACE EVENTHUBSNAME WITH YOUR OWN VALUE
val BOOTSTRAP_SERVERS = "EVENTHUBSNAME.servicebus.windows.net:9093"

// NEW VALUE, REPLACE EVENTHUBSNAME, SECRETKEYNAME, SECRETKEYVALUE WITH YOUR OWN VALUES
val EH_SASL = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"$ConnectionString\" password=\"Endpoint=sb://EVENTHUBSNAME.servicebus.windows.net/;SharedAccessKeyName=SECRETKEYNAME;SharedAccessKey=SECRETKEYVALUE\";"

import org.apache.kafka.common.security.plain.PlainLoginModule

// READ STREAM USING SPARK's KAFKA CONNECTOR
val rates = spark.readStream
    .format("kafka")
    .option("subscribe", TOPIC)
    .option("kafka.bootstrap.servers", BOOTSTRAP_SERVERS)
    .option("kafka.sasl.mechanism", "PLAIN")
    .option("kafka.security.protocol", "SASL_SSL")
    .option("kafka.sasl.jaas.config", EH_SASL)
    .option("kafka.request.timeout.ms", "60000")
    .option("kafka.session.timeout.ms", "60000")
    .option("failOnDataLoss", "false")
    .option("startingOffsets", "latest")
    .option("kafka.group.id", "$Default")
    .load()

// UNCHANGED PROCESSING LOGIC 
rates
    .selectExpr("CAST(key as STRING)", "CAST(value as STRING)")
    .writeStream
    .outputMode("append")
    .format("console")
    .option("truncate", false).start().awaitTermination()

When Should I Still Use Apache Kafka Cluster?

For example, when you want or need to manage your own cluster, or when you want to run Apache Kafka on-premises. Or, if there are certain features not yet supported by Event Hubs Kafka feature.

When Should I Use Event Hubs With Apache Kafka Clients?

When you are happy with what Event Hubs provides, and want to reduce the time you spend on managing clusters. Event Hubs offers better integration with existing Azure services. You can also mix and match Apache Kafka and Event Hubs clients! Event Hubs supports many automation features, like auto-inflate to scale the system and adjust it for the workload.

Take a look at other interesting examples of technologies Event Hubs for Apache Kafka can be used for.

Summary

We can keep using Apache Kafka libraries and connectors when using Event Hubs as a backend event injection system, which opens the door to an incredible number of integrations.

Blog