'How to remove duplicate input messages using Kakfa stream

I have a topic wherein I get a burst of events from various devices. There are n number of devices which emit weather report every s seconds.

The problem is that these devices emit 5-10 records of the same value every s seconds. So if you see the output in the kafka topic for a single device, it is as follows:-

For device1:- t1,t1,t1,t1(in the same moment, then gap of s seconds)t2,t2,t2,t2(then gap of s seconds),t3,t3,t3,t3

However, I want to remove these duplicate records in kafka that come as burst of events. I want to consume as follows:- t1,t2,t3,...

I was trying to use concepts of windowing and ktable that Kafka stream API provide, but it doesn't seem possible. Any ideas?



Solution 1:[1]

You might want to use Kafka's Log compaction. But in order to use it you should have the same key for all the duplicated messages, and a different key for non duplicate messages. Have a look at this.

https://kafka.apache.org/documentation/#compaction

Solution 2:[2]

Would it be an option to read the topic into a KTable using t as the key. The duplicated values would be treated as upserts rather than inserts which would effectively drop them. Then write the KTable into another topic

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sebas
Solution 2 rocknroll