'Pyspark : Query self dataframe to create a new column

I have the following dataset in pyspark

Id Sub
1 Mat
1 Phy
1 Sci
2 Bio
2 Phy
2 Sci

I want to create a df similar to the one below

Id Sub HaMat
1 Mat 1
1 Phy 1
1 Sci 1
2 Bio 0
2 Phy 0
2 Sci 0

How do I do this in pyspark ?

def hasMath(studentID,df):
    return df.filter(col('Id') == studentID & col('sub') = 'Mat' ).count()

df = df.withColumn("hasMath",hasMath(F.col('id'),df1))

But this doesn't seem to work. IS there a better way to achieve this.



Solution 1:[1]

You can use collect_list over a window and expr (for Spark 2.4+) to get the list of subjects for each ID and to filter for Mat.

The size function gets the count of the array.

from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

w = Window().partitionBy("Id")

df.withColumn("list", F.collect_list(col("Sub")).over(w))\
  .withColumn("hasMath", F.size(F.expr("filter(list, x -> x == 'Mat')")))\
  .drop("list").show()

Output:

+---+---+-------+
| Id|Sub|hasMath|
+---+---+-------+
|  1|Phy|      1|
|  1|Mat|      1|
|  1|Sci|      1|
|  2|Phy|      0|
|  2|Bio|      0|
|  2|Sci|      0|
+---+---+-------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Cena