'Flink Sorting A Global Window On A Bounded Stream
I've built a flink application to consume data directly from Kafka but in the event of a system failure or a need to re-process this data, I need to instead consume the data from a series of files in S3. The order in which messages are processed is very important, so I'm trying to figure out how I can sort this bounded stream before pushing these messages through my existing application.
I've tried inserting the stream into a temporary table using the Table API but the sort operator always uses a maximum parallelism of 1 despite sorting on two keys. Can I leverage these keys somehow to increase this parallelism ?
I've been thinking of using a keyed global window but I'm not sure how to trigger on a bounded stream and sort the window. Is Flink a good choice for this kind of batch processing and would it be a good idea to write this using the old Dataset API?
Edit
After some experimentation, I've decided that Flink isn't the correct solution and Spark is just more feature rich in this particular use case. Im trying to consume and sort over 1.5tb of data in each job. Unfortunately some of these partitions contain maybe 100G or more and everything must be in order before I can break those groups up further, which makes sorting this data in the operators difficult.
My requirements are simple, ingest the data from S3 and sort by channel ID before flushing it to disk. Having to think about windows and timestamp assignors just complicates a relatively simple task that can be achieved in 4 lines of Spark code.
Solution 1:[1]
Have you considered using the HybridSource for your use case, since this is exactly for what is was designed? https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/
The DataSet API is deprecated and I would not recommend to use it.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Martijn Visser |
