'How to partition and write DataFrame in Spark with no duplicated data?
I have a script running every day and the result DataFrame is partitioned by running date of the script, is there a way to write results of everyday into a parquet table without duplicated data regardless of running date?
for example I got the result of first day's running and write into a table:
+----+----+----------+
|colA|colB| run_date|
+----+----+----------+
| A| 1|2022-01-01|
| B| 2|2022-01-01|
| C| 3|2022-01-01|
+----+----+----------+
df.write.format("hive").mode("append").partitionBy("run_date").saveAsTable({table_name})
and the result of second day's running i got:
+----+----+----------+
|colA|colB| run_date|
+----+----+----------+
| A| 1|2022-01-02|
| B| 2|2022-01-02|
| D| 4|2022-01-02|
+----+----+----------+
This time I only want to write row No.3 into the results table because row No.1 and row No.2 already exist in the target table.
Can anyone help with this?
Solution 1:[1]
You can take the difference of 2 dataframes according to colA and colB, and then write.
df1.createOrReplaceTempView('tmp1')
df2.createOrReplaceTempView('tmp2')
sql = """
select * from tmp2
where not exists (
select 1 from tmp1
where tmp1.colA=tmp2.colA and tmp1.colB=tmp2.colB)
"""
df = spark.sql(sql)
df.show()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | 过过招 |
