'Streaming job finish before write incremental data
I'm having a problem with a stream job with trigger.once When I run it for the first time, it works fine, write all available data on the path and finish. But on the next day, when there is new data available in the original path, the stream doesn't see it and finish the query before write any data. I'm using autoloader and a sqs queue. The checkpoint path is right, and has the folders offset, commit..
Solution 1:[1]
The behavior you expect should be without trigger like below:
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
.start()
The differenct between triggers as below:
// Default trigger (runs micro-batch as soon as it can)
df.writeStream
.format("console")
.start()
// ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("2 seconds"))
.start()
// One-time trigger
df.writeStream
.format("console")
.trigger(Trigger.Once())
.start()
// Continuous trigger with one-second checkpointing interval
df.writeStream
.format("console")
.trigger(Trigger.Continuous("1 second"))
.start()
You can refer https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#triggers
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Warren Zhu |
