'Writing data overwrites existing partitions

I have a spark job where I am writing data to parquet to s3.

val partitionCols = Seq("date", "src")

    df
      .coalesce(10)
      .write
      .mode(SaveMode.Overwrite)
      .partitionBy(partitionCols: _*)
      .parquet(params.outputPathParquet)

When I run the job on EMR it overwrites all the partitions and writes it to s3

eg: data looks like this:

s3://foo/date=2021-01-01/src=X
s3://foo/date=2021-11-01/src=X
s3://foo/date=2021-10-01/src=X

where

params.outputPathParquet = s3://foo

When I run the job for another day

eg: 2021-01-02 it replaces all existing partitions and data looks like the following

 s3://foo/date=2021-01-02/src=X

Any ideas what might be happening ?

scala apache-spark

Solution 1:^[1]

If you just need append data, you can change the SaveMode

.mode(SaveMode.Append)

If you need overwrite some specific partition, take a look at this question: Overwrite specific partitions in spark dataframe write method

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Rafael Sakurai

'Writing data overwrites existing partitions

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]