'Spark overwrite does not delete files in target path
My goal is to build a daily process that will overwrite all partitions under specific path in S3 with new data from data frame.
I do -
df.write.format(source).mode("overwrite").save(path)
(Also tried the dynamic overwrite option).
However, in some runs old data is not being deleted. Means I see files from old date together with new files under the same partition. I suspect it has something to do with runs that broke in the middle due to memory issues and left some corrupted files that the next run did not delete but couldn’t reproduce it yet.
Solution 1:[1]
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic") - option will keep your existing partition and overwriting a single partition. if you want to overwrite all existing partitions and keep the current partition then unset the above configurations. ( i tested in spark version 2.4.4)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Learn Hadoop |
