'Pyspark : Query self dataframe to create a new column
I have the following dataset in pyspark
| Id | Sub |
|---|---|
| 1 | Mat |
| 1 | Phy |
| 1 | Sci |
| 2 | Bio |
| 2 | Phy |
| 2 | Sci |
I want to create a df similar to the one below
| Id | Sub | HaMat |
|---|---|---|
| 1 | Mat | 1 |
| 1 | Phy | 1 |
| 1 | Sci | 1 |
| 2 | Bio | 0 |
| 2 | Phy | 0 |
| 2 | Sci | 0 |
How do I do this in pyspark ?
def hasMath(studentID,df):
return df.filter(col('Id') == studentID & col('sub') = 'Mat' ).count()
df = df.withColumn("hasMath",hasMath(F.col('id'),df1))
But this doesn't seem to work. IS there a better way to achieve this.
Solution 1:[1]
You can use collect_list over a window and expr (for Spark 2.4+) to get the list of subjects for each ID and to filter for Mat.
The size function gets the count of the array.
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window
w = Window().partitionBy("Id")
df.withColumn("list", F.collect_list(col("Sub")).over(w))\
.withColumn("hasMath", F.size(F.expr("filter(list, x -> x == 'Mat')")))\
.drop("list").show()
Output:
+---+---+-------+
| Id|Sub|hasMath|
+---+---+-------+
| 1|Phy| 1|
| 1|Mat| 1|
| 1|Sci| 1|
| 2|Phy| 0|
| 2|Bio| 0|
| 2|Sci| 0|
+---+---+-------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Cena |
