'apache flink to process only x% of events in a tumbling windows

I have an AWS kinesis data stream continuously receiving events and these events are sent to kinesis data analytics to get metrics over a tumbling window using apache flink.

is it possible to dump x% of random data in a tumbling window to s3 bucket? if yes, please share the code snippet.



Solution 1:[1]

Your tumbling window output, if it is not aggregated, I assume will be some kind of Seq[Element]. The simplest approach I can think of would be to sample the elements you want to store in S3 using a flatMap operator and connect the output to a FileSink writing to S3.

The code will look quite different depending on the output format. The simplest example would be:

val windowOutStream: Seq[Element] = ...
val sampledOutStream: Seq[String] = windowOutStream.flatMap(window => {
// iterate over all the elements in the window
// filter the ones you want to store to S3
// encode the element as a string and add the window start/end times so you can later identify to which window they belonged to
})

// writeAsText will write each element in a new line in the same file
sampledOutStream.writeAsText("s3://<bucket>/<endpoint>");

You'll probably want to use a more optimized output format than just a string, and also add a rolling policy so new files are automatically created.

The docs show how to initialize Row or Bulk encoded file sinks that use other output encodings and are more flexible, but the idea would be similar.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1