Using Kafka as a Data Backbone: Part 1

dorner

Daniel Orner

Posted on November 22, 2019

Using Kafka as a Data Backbone: Part 1

Apache Kafka, a distributed, asynchronous streaming platform, has exploded in popularity over the last few years. It boasts a number of advantages, including fault-tolerance, availability, reliability and scalability, and is being used by hundreds of companies ranging from tiny startups to enormous companies like PayPal and Twitter.

In this article, I’ll describe how Kafka can be used to act as the data backbone for your microservice architecture, and provide a host of advantages while being able to solve many of the inherent disadvantages that come with that pattern. I’ll also cruelly hint at an awesome open-source toolkit we built that solves a big part of this but which will have an entire Part 2 dedicated to it. 😈

Event-Driven Microservices

First off — the use case we are describing here is for an ecosystem largely consisting of event-driven microservices. It’s beyond the scope of this article to dive deep into this pattern, but in a nutshell, we want to:

  • Decrease coupling between features and services.

  • Allow different teams to work independently on unrelated features.

  • Allow services to scale independently.

  • Allow services to be deployed independently.

  • Enforce a strong contract via technology (instead of convention) so we can be largely confident in contract testing instead of integration testing.

There are a number of challenges with this paradigm, though — in this article I’ll describe one way to use Apache Kafka to solve some of those challenges.

The Data Backbone

In order for our services to be able to do their jobs, they will have to do one of two things:

  1. Have all the data it needs to do its job, OR

  2. Be able to ask some other service in real-time for the data it needs to do its job.

Option 2 is the one mostly used by HTTP REST services. Unfortunately, this violates some of our goals above, notably decreased coupling and independence, since service B needs to know how to reach service A, and if service A goes down, service B now needs to handle that case.

Instead, our pattern is to copy all necessary data from service A into service B. This way, each service has immediate use of all the data it needs to do its job.

Just put all your data on the backbone!

Just put all your data on the backbone!

Ignoring the increased use of disk space (which is really cheap nowadays), there are a couple of problems with this approach:

  • How does service B get service A’s data? Having it talk directly to service A’s database is a no-no, since that violates independence yet again.

  • How can we ensure real-time (or close to real-time) updates of that data?

  • How can we make sure we don’t miss any updates?

  • How do we communicate the structure of the data (and changes to that structure) without being strongly coupled to the original service?

Turns out, Apache Kafka and Apache Avro have some cool features that will help us achieve our goals and neatly bypass most if not all of our challenges.

Upsert Messages

Every message in Kafka has a key associated with it. This key by convention does not indicate the ID of the message, but of the entity **represented by that message. This allows Kafka to contain “upsert” messages, where each message represents the **current state of an entity.

Kafka itself uses this idea in its compaction algorithms — if compaction is turned on, it will automatically delete all messages but the last one which share the same key. This allows new consumers to simply skip all previous updates of a particular entity and jump straight to its current state. A deletion can be indicated by sending the entity’s key with a null message.

For example, if you send a message on a Characters topic that says name: "Luke" with a key of 105, then a second message that says name: "Darth" with a key of 105, Kafka will eventually delete the first message and only leave the last, most updated one, giving Mr. 105 the name “Darth”. If you then send a null message with a key of 105, the last message will eventually be replaced with null, indicating Mr. 105 has been deleted (this is known as a “tombstone” message).

Key compaction — old data is removed and only latest state remains.

Key compaction — old data is removed and only latest state remains.

It’s a small feature, but it has huge ramifications. It means you can use Kafka as **the data backbone of your system — as a distributed, subscribable source of truth for any type of data. **You can send the current state of any object and downstream systems will know to create, update or delete it as necessary (I explore this fully below).

This means you can use Kafka to tie together any number of smaller services, each with their own database, as long as those services send the updates of their data to Kafka individually (and Kafka is set to keep those messages around as appropriate).

Structuring Your Data

This only solves part of the problem, though. We want to have these messages represent our data for anyone who wants to read it. We don’t want to get into API hell where every downstream system needs to know about the system that generated that data so it can understand its structure. We want true independence.

What we really want is to impose some kind of schema on our messages so that not only are we sending updated state, we are also enforcing a contract between producers and consumers so that state is reliable and understandable. Also important is that the contract needs to be flexible so that we can update the producer in specific ways without immediately causing all our downstream producers to crash.

Enter Apache Avro. This specification not only defines a schema language, but also a binary encoding of that language so that downstream systems can deterministically decode the message to the schema.

Even better, it introduces the concept of “compatible” schemas, meaning that the schemas on the upstream and downstream systems don’t have to match exactly as long as they are compatible. Upstream systems are free to add fields and make changes, as long as those changes are done in a compatible way, and downstream systems don’t need to move to the new schema unless they need the new changes.

Example of a compatible change — adding a new field with a default value.

Example of a compatible change — adding a new field with a default value.

The combination of Kafka and Avro provides a strongly structured, distributed data backbone for a set of microservices.

Duplicating Your Data

But if data is “owned” by Service A, but is needed by Service B, how does it get there? The answer is duplication of data. Service B contains some kind of materialization (in a SQL/NoSQL database, in memory, etc.) of the contents of a Kafka topic. This materialization is by definition ephemeral. At any time, a service should be able to blow away its materialization and reconstruct it from the Kafka topic. The topic is the source of truth of data — all materializations are a view into it.

This allows each service to truly be independent. Because data gets into a service asynchronously, no service has a direct link to any other service. If your Orders service has to know about Users, and your Users service is currently down, the Orders service doesn’t care. Its data might be a bit out of date, but it reads it from its own materialization of that data. In fact, it doesn’t even know that a Users service exists —all it knows is that there is a Users topic.

Consuming services get information from their data stores, not directly from the producing service.

Consuming services get information from their data stores, not directly from the producing service.

And because every message in the topic is encoded with the same Avro schema, it has a guarantee of what the contents of that topic look like — and because it provides its own reader schema, another guarantee that it’ll never crash when reading a message (assuming we don’t break the requirement that all messages in a topic are encoded with the same schema).

Now that’s all well and good. But how do we implement this “source of truth” in Kafka? What does it look like in the real world? More specifically, how does the data get into Kafka in a safe way?

Kafka and State — Uneasy Allies

The “ideal” state is one where all data is written to the topic first, and then read back into the service’s materialization of that topic. This means that even for producing services, there is no primary database or other state — the topic is everything.

“Ideal” workflow — write to topic, then read it back to the data store.

“Ideal” workflow — write to topic, then read it back to the data store.

In practice, this is hard to swallow. Confluent, who is the leading company in the Kafka business, provides some helpful tooling for JVM languages around this pattern (e.g. KTables and KSQL), but it all masks the point that Kafka is always asynchronous — and sometimes synchronous things need to happen. Particularly, anything that’s user-facing will often need to render the most up-to-date state back to the user. There are a number of tactics to allow the pattern to still work in this case (add a throbber / Ajax request, use some kind of cache, etc.) but they all introduce a whole whack of complexity which honestly isn’t really needed.

What’s more, using Kafka as your single state is simply not always the best option. Relational databases are decades old, and there are hundreds if not thousands of tools and frameworks built around them. They are well understood, extremely performant for most use cases, and incredibly good at what they do. Most web frameworks assume some kind of transaction-enabled, SQL-based database to do what they need to do. We should be able to leverage this standardized technology and the new hotness of Kafka at the same time.

Databases and Kafka

There are two current leaders in the field of marrying databases and Kafka: Confluent’s Kafka Connect (which polls tables in the database and turns them into Avro-encoded messages), and Debezium (which tails the binary log of your database). Both are used pretty extensively, but they each have downsides. Kafka Connect has no way of sending deletions, since it polls the database, and deleted rows aren’t in the database any more. Debezium is heavily tied to your internal database schema (so is Kafka Connect by default, but you can write custom connectors to get around that).

So unfortunately, neither of these solutions gives you the whole picture without a bunch of pain points.

In an ideal world, your database schema is perfect! Everyone will want every column on your table and nothing in there is confusing in any way.

In the real world, you need your database schema to be internal and shielded from your consumers. You’re going to have implementation details that just aren’t relevant to anyone else. You may also want to combine multiple tables into a single event without having to write a separate joiner service downstream or being forced to write views for every table. You want to publish external-facing messages which use your data but not your schema.

That’s problem 1.

The Transaction Gotcha

We’ve been talking about sending your database state around — but let’s not forget that Kafka can and should still be used to send events. Most services that have state will still want to send events down the Kafka pipe. Keeping this in mind introduces one more sneaky issue: Transactions.

When writing data to a database, the write can fail (due to deadlocks, timeouts, etc.). Writes can also fail to Kafka (downtime, timeouts, broker issues). If one succeeds and not the other, you’ve now broken your guarantees of “single source of truth”.

Let’s say you’re writing your Character to a table, and sending a “Character Requested” event to a separate Kafka topic (perhaps with user and timestamp information). If you do this all in one transaction, the Character write can succeed but the sending the event can fail (meaning you know about the new character but the event didn’t go out), or the event can succeed but the database write can fail (meaning the event went out but it’s referencing a character that was never written).

Write works to Kafka but not the database — Luke is in the topic but not the service’s own DB.

Write works to Kafka but not the database — Luke is in the topic but not the service’s own DB.

Write works to the DB but not Kafka — downstream systems will not get the new write.

Write works to the DB but not Kafka — downstream systems will not get the new write.

That’s problem 2.

So how do we do it? How do we marry our database and Kafka in a way that keeps our data eventually consistent, allows us to use all our lovely relational database tools for our web apps, provides the advantages of an Event-Driven Microservices architecture, and even lets us break up long-running monoliths?

Go forward and read Part 2!

💖 💪 🙅 🚩
dorner
Daniel Orner

Posted on November 22, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Using Kafka as a Data Backbone: Part 1
microservices Using Kafka as a Data Backbone: Part 1

November 22, 2019