'I need to compare two csv files using pyspark

The output spark dataframe should contain all the rows from both dataframes and a new column(Boolean named common_row) .This column will be true or false based on the equality of the row.

Csv1
---
Name
John

Csv2
-----
Name
Arun

Df3
------
#Name# ##Name## ###Common_Row###
John. Arun False

Solution 1:^[1]

Try using below logic -

Input Data

df1 = spark.createDataFrame(data = [('John',)], schema = ['Name',])
df2 = spark.createDataFrame(data = [('Arun',)], schema = ['Name',])

from pyspark.sql.functions import *

(df2.join(df1, df2['Name'] == df1['Name'], how = 'full')).select(last(df1["Name"]).alias("df1_Name"),first(df2["Name"]).alias("df2_Name")).withColumn("Common_Row", lit(False)).show()

#Output 

+--------+--------+----------+
|df1_Name|df2_Name|Common_Row|
+--------+--------+----------+
|    John|    Arun|     false|
+--------+--------+----------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	DKNY

'I need to compare two csv files using pyspark

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]