'Compare two dataframe in pyspark and change column value

I have two pyspark dataframes like this: df1:

|language|users_count|
+--------+-----------+
|    Java|      20000|
|  Python|     100000|
|   Scala|       3000|
+--------+-----------+

df2:

+------------+---+
|src_language|abb|
+------------+---+
|        Java|  J|
|      Python|  P|
|       Scala|  S|
+------------+---+

I want to compare these two dataframes and replace the column value in df1 with abb in df2. So the output will be:

|language|users_count|
+--------+-----------+
|    J   |      20000|
|    P   |     100000|
|    S   |       3000|
+--------+-----------+

How can I achieve this?



Solution 1:[1]

You can simply join the two dataframes and then simply rename the column name to get the required output.

#Sample Data :
 
columns = ['language','users_count']
data = [("Java","20000"), ("Python","100000"), ("Scala","3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)

columns1 = ['src_language','abb']
data1 = [("Java","J"), ("Python","P"), ("Scala","S")]
rdd1 = spark.sparkContext.parallelize(data1)
df1 = rdd1.toDF(columns1)

#Joining dataframes and doing required transformation

df2 = df.join(df1, df.language == df1.src_language,"inner").select("abb","users_count").withColumnRenamed("abb","language")

Once you perform show or display on the dataframe you can see the output as below :

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Nikunj Kakadiya