'How to reduce number of checkpoint files writen by spark streaming

If spark streaming job involves shuffle and stateful processing, it's easy to generate lots of small files per micro batch. We should decrease the number of files without hurting latency.

Solution 1:^[1]

If using all default configs, one spark streaming micro batch will generate 80 k files. This will casue high qps and latency for hdfs. Better change below configs to reduce checkpoint files.

Config	Default	Suggested
`spark.sql.streaming.minBatchesToRetain`	100	30
`spark.sql.streaming.stateStore.minDeltasForSnapshot`	10	5
`spark.sql.shuffle.partitions`	200	Depends on micro batch size, 50 or 100

So, total number of files = minBatchesToRetain * 4 (left 2 + right 2) * partitions * operators(each join or aggregation)

If all config are default, it will be 100 * 4 * 200 * 1 = 80 K

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Warren Zhu

'How to reduce number of checkpoint files writen by spark streaming

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]