'Writing data overwrites existing partitions
I have a spark job where I am writing data to parquet to s3.
val partitionCols = Seq("date", "src")
df
.coalesce(10)
.write
.mode(SaveMode.Overwrite)
.partitionBy(partitionCols: _*)
.parquet(params.outputPathParquet)
When I run the job on EMR it overwrites all the partitions and writes it to s3
eg: data looks like this:
s3://foo/date=2021-01-01/src=X
s3://foo/date=2021-11-01/src=X
s3://foo/date=2021-10-01/src=X
where
params.outputPathParquet = s3://foo
When I run the job for another day
eg: 2021-01-02 it replaces all existing partitions and data looks like the following
s3://foo/date=2021-01-02/src=X
Any ideas what might be happening ?
Solution 1:[1]
If you just need append data, you can change the SaveMode
.mode(SaveMode.Append)
If you need overwrite some specific partition, take a look at this question: Overwrite specific partitions in spark dataframe write method
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rafael Sakurai |
