'Spark Structured Kafka Stream Windowing: Cannot get partial output to show up and wondering what I'm doing wrong
My use case is as follows:
I am consuming from a real-time Kafka topic and attempting to perform windowed aggregations.
I want to calculate the sum of a numeric column grouped by an identifier over a 1-hour window in a Spark structured stream. The PySpark code is something like this:
windowed_df = (
input_df.withWatermark("timestamp", "10 minutes")
.groupBy(col("identifier"), window(col("timestamp"), "1 hour", "30 minutes"),
.agg(sum("numeric_column"), count("numeric_column"))
)
The catch here is that I want real-time results of the window even if the window time has not finished yet. For example, if there is data on the identifier within the first 30 seconds, I want a partial sum outputted even though it's before the 1-hour mark, and then later, if more data comes, I would want a complete sum outputted.
My use case is to check if an identifier has exceeded a threshold using its numeric column within 1 hour, but that doesn't mean I want to wait the full hour to know if it exceeded the threshold; if it already exceeds the threshold in a few seconds, I want to know right away. So I would like partial window sums that output every time there is an update to the row and not every time the window has completed.
I've already tried the "update" output mode instead of "append", and I thought it would do what I want, but I'm still not getting any output at all in the time frame that I want.
Also: I'm using Spark 2.3.
I'm guessing this is something I am fundamentally misunderstanding about window functions / Spark.
Any advice?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
