Analytics don't want duplicated data, so get it exactly-once with Flink/Kafka

Data engineer's main task is to deliver data from multiple places (it can be database, Kafka cluster, or else) to destination with defined transformation.

In this part, one of important part is that input and output should be same. It means data should not be lost, or duplicated.
It will lower data quality/accuracy, and it will lead to incorrectness analyzed result.

Rules of message delivery

Usually data pipeline is composed with multiple components. It should includes message delivery clusters(such as Kafka, Pulsar), data processing system(such as Spark, Flink), and several more things can be added by its purpose. Following are basic format of data pipeline.

Now think of a case when outage caused in 'data processing platform'. In that time it will stop consuming data from delivery platform until recovery, and data inside will be deleted unless there are no additional work-around, because 'data delivery platform' indicates that the message has already be sent.

To control this, in case of 'Kafka', there are 3 options for delivery.

At most once: Messages are delivered once, and if there is a system failure, messages may be lost and are not redelivered.
At least once: This means messages are delivered one or more times. If there is a system failure, messages are never lost, but they may be delivered more than once.
Exactly once: This is the preferred behavior in that each message is delivered once and only once. Messages are never lost or read twice even if some part of the system fails.

Exactly-once processing for Kafka -> Flink

As you can expect through this intuitive naming, it is logic to guarantee that data pipeline delivers single data to destination only once, without data-loss.

Of course this is most ideal result, but it requires most complicated logic than other 2. Following process is about Kafka -> Flink(but of course, there are similar mechanisms in other popular services too).

For Flink

Flink uses checkpointing mechanism to achieve exactly-once processing within its own ecosystem.

Here's how it works:

Checkpointing: Flink periodically takes snapshots of the entire application state, including the position in the input streams.
Distributed Snapshots: These checkpoints are consistent across all parallel instances of the application.
Failure Recovery: If a failure occurs, Flink restores the entire application state from the last successful checkpoint.

For Kafka

To achieve exactly-once semantics when reading from, and writing to Kafka, Flink uses a two-phase commit protocol:

Reading from Kafka: Flink's Kafka consumer stores offsets in checkpoints, ensuring that after a failure, it resumes from the correct position.
Writing to Kafka: Flink uses Kafka's transaction feature to ensure that writes are atomic and tied to Flink's checkpoints.

Two-Phase Commit Protocol:

The two-phase commit protocol ensures that the state changes in Flink and the writes to Kafka are atomically committed.

Prepare Phase: During checkpointing, Flink prepares the Kafka transaction but doesn't commit it yet.
Commit Phase: After the checkpoint is successfully completed, Flink commits the Kafka transaction.

TwoPhaseCommitSinkFunction:

Flink provides a TwoPhaseCommitSinkFunction that abstracts the common logic of the two-phase commit protocol. This makes it easier for developers to implement exactly-once sinks for various external systems.

So, is there a reason not to use it?

As you can expect, making exactly-once through distributed streaming system like above is very complex task. And though these processing systems are offering robust mechanisms as possible, it requires more processing resources, and can affect performance.

So instead of using it blindly, you can adopt by cases. If data duplication is allowable, you can use 'at-least-once' option which delivers the data from the last success point. Or some of data-loss is allowable than duplicated data, you can make data just to be delivered without checking the status of endpoint system by using 'at-most-once'.

Blog

Analytics don't want duplicated data, so get it exactly-once with Flink/Kafka

kination