'Compare two dataframe in pyspark and change column value
I have two pyspark dataframes like this: df1:
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
df2:
+------------+---+
|src_language|abb|
+------------+---+
| Java| J|
| Python| P|
| Scala| S|
+------------+---+
I want to compare these two dataframes and replace the column value in df1
with abb in df2. So the output will be:
|language|users_count|
+--------+-----------+
| J | 20000|
| P | 100000|
| S | 3000|
+--------+-----------+
How can I achieve this?
Solution 1:[1]
You can simply join the two dataframes and then simply rename the column name to get the required output.
#Sample Data :
columns = ['language','users_count']
data = [("Java","20000"), ("Python","100000"), ("Scala","3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
columns1 = ['src_language','abb']
data1 = [("Java","J"), ("Python","P"), ("Scala","S")]
rdd1 = spark.sparkContext.parallelize(data1)
df1 = rdd1.toDF(columns1)
#Joining dataframes and doing required transformation
df2 = df.join(df1, df.language == df1.src_language,"inner").select("abb","users_count").withColumnRenamed("abb","language")
Once you perform show or display on the dataframe you can see the output as below :
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Nikunj Kakadiya |