'I need to compare two csv files using pyspark
The output spark dataframe should contain all the rows from both dataframes and a new column(Boolean named common_row) .This column will be true or false based on the equality of the row.
Csv1
---
Name
John
Csv2
-----
Name
Arun
Df3
------
#Name# ##Name## ###Common_Row###
John. Arun False
Solution 1:[1]
Try using below logic -
Input Data
df1 = spark.createDataFrame(data = [('John',)], schema = ['Name',])
df2 = spark.createDataFrame(data = [('Arun',)], schema = ['Name',])
from pyspark.sql.functions import *
(df2.join(df1, df2['Name'] == df1['Name'], how = 'full')).select(last(df1["Name"]).alias("df1_Name"),first(df2["Name"]).alias("df2_Name")).withColumn("Common_Row", lit(False)).show()
#Output
+--------+--------+----------+
|df1_Name|df2_Name|Common_Row|
+--------+--------+----------+
| John| Arun| false|
+--------+--------+----------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | DKNY |
