'Kafka Topic - filter and dispatch messages
Background
Our software solution collects data ("events") per customer.
Some of the customers (a small fraction ~3%) ask to get this data into "their systems" (they need to pay for this service).
A target system where we need to send those events might be:
- AWS S3
- Azure Storage
- Splunk
- DataDog
- More target systems to come in the future..
All target systems above have well known Kafka Connect Sink connectors so the idea is to use those connectors in order to export the data.
Possible Solution
- All customer events goes to one "input" topic
- Custom software consumes messages from the Kafka "input" topic
- The software looks at the message attributes and based on one of the attributes value (lets call it customer_id) decide if the message should be dropped or published to anther Kafka topic named '<customer_id>_topic'.
The destination topic will probably be part of a different cluster. I understand this can be easily done using Kafka Streams.
Note that I am aware of the thread Disperse messages in Kafka stream
My question is - can it be done using Kafka Connect and SMT?
I am looking for a "managed" solution and since our Kafka run in AWS MSK I dont need to manage the Kafka Connect cluster. With Kafka Streams I will have to install my software on EC2 / ECS - isnt it
Solution 1:[1]
The destination topic will probably be part of a different cluster. I understand this can be easily done using Kafka Streams
Kafka Streams can/should only write to the same cluster. It cannot guarantee delivery to others.
For sending data to other clusters, MirrorMaker would be a starting point.
As you might know, RegexRouter can rename the topic, but it cannot pull out dynamic fields from the record and rename the topic - you'd need to write your own transform for this.
You should be able to also use the Filter transform to inspect/drop events, but out-of-the-box this only will work on top-level fields, not nested ones.
Overall, I find having one topic name "per id" a bad design, assuming you might (eventually) have tens to thousands of ids.
Alternatively, managing tens to thousands of clusters "per customer" (or, at least sectioning off clusters with quotas "per client", although not sure how multi tenancy would work with duplicated topic names) might be difficult, too, but that's basically what MSK or Confluent Cloud are doing.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
