'Memory leak in Spark Structured Streaming driver

I am using Spark 3.1.2 with hadoop 3.2.0 to run Spark Structured Streaming (SSS) aggregation job, running on Spark K8S.

Theses job are reading files from S3 using SSS provided File Source input and also use S3 for checkpointing (with the directory output commiter).

What I noticed is that, after few days of running, the driver is having memory issue and crash.

As the driver is not doing many things (just calling Spark SQL functions and write the output to S3), I am wondering how to detect the source of these memory issues (memory leaking from hadoop/S3A library ?) and how I can fix them.

Memory usage of Spark Driver

As shown on the screenshot, the driver take some time befoe using all the memory, and once it reached it, it seems to be able to call GC enough time. But after 1 week of running, it crash, as if the GC doesn't run enough/doesn't find something to clean.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source