'PySpark: writing in 'append' mode and overwrite if certain criteria match

I am append the following Spark dataframe to an existing Redshift database. And I want to use 'month' and 'state' as criterias to check, and replace data in the Redshift table if month = '2021-12' and state = 'ga'

dataframe we want to append: df

month state product
2021-12 ca ho0
2021-12 ca ho1
2021-12 ca ho2
2021-12 ca ho3
2021-12 ca ho4
2021-12 ga ho5
2021-12 ga ho6
2021-12 ga ho7
2021-12 ga ho8
2021-12 ga ho9

I've tried the following script to append it, looks like it only append df dataframe, without replacing (overwriting) existing records for month '2021-12' and state 'ga'.

df.write \
  .format("xxx") \
  .option("url", "xxx") \
  .option("dbtable", "table1") \
  .option("tempdir", "xxxx") \
  .option("aws_iam_role", "xxxx") \
  .mode("append") \
  .option("replaceWhere", "month == '2021-12' AND state == 'ga'") \
  .save()

I guess .option("replaceWhere", "month == '2021-12' AND state == 'ga'") is not working. How can I make the change? thanks! (I also tried the following part, looks like existing records are gone, and replaced by df)

  .mode("overwrite") \
  .option("replaceWhere", "month == '2021-12' AND state == 'ga'") \


Solution 1:[1]

replaceWhere This option works almost like a dynamic overwrite partition, basically you are telling Spark to overwrite only the data that is on those range partitions. In addition, data will be saved only if your dataframe matches the condition replaceWhere, otherwise, if a single row does not match, an exception Data written out does not match replaceWhere will be thrown. You an ref link .If you want to handle same scenario within partition , you can use MERGE INTO .

Solution 2:[2]

Spark cannot do data merge with Redshift directly, since we are using Databricks, it provide 'Delta Lake' as an environment to do data upsert / merge, then we write the table from Delta Lake to Redshift. https://docs.databricks.com/delta/delta-update.html#language-python

Solution 3:[3]

You need to follow this link below when working with Redshift. A little more than you need, but applicable. The link shows UPSERT simulation. The link is http://www.silota.com/blog/amazon-redshift-upsert-support-staging-table-replace-rows/

Even inserting to Redshift is not a Spark thing, I have posted on this in the past. One uses COPY from s3 standardly.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Karthikeyan Rasipalay Durairaj
Solution 2 Crubal Chenxi Li
Solution 3 thebluephantom