'How to partition and write DataFrame in Spark with no duplicated data?

I have a script running every day and the result DataFrame is partitioned by running date of the script, is there a way to write results of everyday into a parquet table without duplicated data regardless of running date?

for example I got the result of first day's running and write into a table:

+----+----+----------+
|colA|colB|  run_date|
+----+----+----------+
|   A|   1|2022-01-01|
|   B|   2|2022-01-01|
|   C|   3|2022-01-01|
+----+----+----------+
df.write.format("hive").mode("append").partitionBy("run_date").saveAsTable({table_name})

and the result of second day's running i got:

+----+----+----------+
|colA|colB|  run_date|
+----+----+----------+
|   A|   1|2022-01-02|
|   B|   2|2022-01-02|
|   D|   4|2022-01-02|
+----+----+----------+

This time I only want to write row No.3 into the results table because row No.1 and row No.2 already exist in the target table.

Can anyone help with this?



Solution 1:[1]

You can take the difference of 2 dataframes according to colA and colB, and then write.

df1.createOrReplaceTempView('tmp1')
df2.createOrReplaceTempView('tmp2')
sql = """
    select * from tmp2
    where not exists (
        select 1 from tmp1
        where tmp1.colA=tmp2.colA and tmp1.colB=tmp2.colB)
"""
df = spark.sql(sql)
df.show()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 过过招