Understanding and Resolving Infinite Consumer Lag Growth on Compacted Kafka Topics
Kleinanzeigen & mobile.de
Posted on June 25, 2024
an article by André Charton
Kleinanzeigen has been using Kafka since 2016 as a distributed streaming platform of choice. We have many real-time data pipelines and streaming applications running on top. Some of our topics are compacted...
What is a compacted topic?
A compacted topic in Apache Kafka is a special type of topic where Kafka’s log compaction feature is enabled. It helps retain the latest records for each key in the topic while removing older records for the same key. This pattern we apply for topics in front of our ElasticSearch indices, so we can use it as a scalable source of truth to index and also full index.
What is consumer lag?
Consumer lag is a metric that measures how far behind a consumer is from the latest message in a Kafka topic/partition. It holds the number of messages that the consumer needs to process. Sometimes we see a lag increase, while an application bottlenecks, on network issues, etc.
Per default monitoring consumer lag ensures that consumers are keeping up with the producers. We expose this metric for our clusters and have it in Prometheus, visualised in Grafana.
What is an offset reset?
In Apache Kafka, an offset reset refers to the operation of changing the current offset position for a consumer group. The offset determines the position from which the consumer will start reading records from a partition. This strategy we can perfectly use to execute a full index on our indices, described above.
Why infinitive growth?
Since we using Kafka 8+ years, some topics getting older and older. A compacted topic for instance containing user posted ads (used by full index our major search index). With the years we see on full index operation the lag is getting bigger and bigger. Recently we saw numbers above 400M. We wondered, getting nervous and invested. But it happens by the nature of combing a compacted topic and offset reset.
Over time the distance between “now” and the oldest record will growth until the oldest record is gone. We have some user ads from even before 2016, because user can extend ad lifetime again and again. So when we perform an offset reset, a consumer will start at the beginning: [0], in the sample below at [2]. Our log metric would show a lag of [8] still it just needs to produce 3 records. So this explains the spike we saw in Grafana metric which measures “just” the offset.
Conclusion
Be careful on the interpretation of lag metrics on compacted topics in case of offset reset. In our example of a full index and lag of 400M, we count just less than 60M records get processed.
Another option could be to rewrite the topic using MirrorMaker and a new topic name. But we are fine with understanding here.
Special thanks to my colleague Daniil Roman who inspired me to this article.
Posted on June 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.