'Pyspark : Query self dataframe to create a new column

I have the following dataset in pyspark

Id	Sub
1	Mat
1	Phy
1	Sci
2	Bio
2	Phy
2	Sci

I want to create a df similar to the one below

Id	Sub	HaMat
1	Mat	1
1	Phy	1
1	Sci	1
2	Bio	0
2	Phy	0
2	Sci	0

How do I do this in pyspark ?

def hasMath(studentID,df):
    return df.filter(col('Id') == studentID & col('sub') = 'Mat' ).count()

df = df.withColumn("hasMath",hasMath(F.col('id'),df1))

But this doesn't seem to work. IS there a better way to achieve this.

pyspark

Solution 1:^[1]

You can use collect_list over a window and expr (for Spark 2.4+) to get the list of subjects for each ID and to filter for Mat.

The size function gets the count of the array.

from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql.window import Window

w = Window().partitionBy("Id")

df.withColumn("list", F.collect_list(col("Sub")).over(w))\
  .withColumn("hasMath", F.size(F.expr("filter(list, x -> x == 'Mat')")))\
  .drop("list").show()

Output:

+---+---+-------+
| Id|Sub|hasMath|
+---+---+-------+
|  1|Phy|      1|
|  1|Mat|      1|
|  1|Sci|      1|
|  2|Phy|      0|
|  2|Bio|      0|
|  2|Sci|      0|
+---+---+-------+

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Cena

'Pyspark : Query self dataframe to create a new column

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]