'Spark Windowing Functions: Refer values in previous window to populate columns in current Window
I am trying to find the back up of an employee who is going to be on leave. All my back up options are in the below format. For any selected day, an employee might have more than one back up persons and i need to select any one of them as back up. Now on the same day, while selecting back up for another employee i need to ensure that already assigned back up person should not be assigned again.
+-----------+--------+----------+
|backup_name|emp_name|leave_date|
+-----------+--------+----------+
| Noah| John|09/03/2022|
| Peter| John|09/03/2022|
| Susan| John|09/03/2022|
| Noah| Jane|09/03/2022|
| Peter| Jane|09/03/2022|
| Susan| Jane|09/03/2022|
| Noah| John|10/03/2022|
| Stephen| John|10/03/2022|
| Peter| Jane|10/03/2022|
| Susan| Jane|10/03/2022|
| Joe| Ashley|10/03/2022|
+-----------+--------+----------+
My expected result is some thing like below (P.S: on 09/03/2022, Jane's back up is Peter as Noah is already assigned for John )
+-----------+--------+----------+--------+
|backup_name|emp_name|leave_date|isBackUp|
+-----------+--------+----------+--------+
| Noah| John|09/03/2022| Y|
| Peter| John|09/03/2022| |
| Susan| John|09/03/2022| |
| Noah| Jane|09/03/2022| |
| Peter| Jane|09/03/2022| Y|
| Susan| Jane|09/03/2022| |
| Noah| John|10/03/2022| Y|
| Stephen| John|10/03/2022| |
| Peter| Jane|10/03/2022| Y|
| Susan| Jane|10/03/2022| |
| Joe| Ashley|10/03/2022| Y|
+-----------+--------+----------+--------+
I had tried partitioning the dataframe using leave_date and emp_name then selecting the first or last person and set remaining columns in the same Window to null with the help of row_number().
from pyspark.sql.functions import lit,first,row_number,when,exists,col,dense_rank,rank
from pyspark.sql.window import Window
data = [{"emp_name":"John","leave_date":"09/03/2022","backup_name":"Noah"},{"emp_name":"John","leave_date":"09/03/2022","backup_name":"Peter"},{"emp_name":"John","leave_date":"09/03/2022","backup_name":"Susan"},{"emp_name":"Jane","leave_date":"09/03/2022","backup_name":"Noah"},{"emp_name":"Jane","leave_date":"09/03/2022","backup_name":"Peter"},{"emp_name":"Jane","leave_date":"09/03/2022","backup_name":"Susan"},{"emp_name":"John","leave_date":"10/03/2022","backup_name":"Noah"},{"emp_name":"John","leave_date":"10/03/2022","backup_name":"Stephen"},{"emp_name":"Jane","leave_date":"10/03/2022","backup_name":"Peter"},{"emp_name":"Jane","leave_date":"10/03/2022","backup_name":"Susan"},{"emp_name":"Ashley","leave_date":"10/03/2022","backup_name":"Joe"}]
df = spark.createDataFrame(data)
df.show()
overColumnsBackUp = Window.partitionBy("leave_date","emp_name").orderBy("backup_name")
overColumnsMain = Window.partitionBy("leave_date","emp_name").orderBy("emp_name").rowsBetween(Window.unboundedPreceding, Window.currentRow)
df_new = df.withColumn("backup_rownumber", row_number().over(overColumnsBackUp)).withColumn("bkup",when(col('backup_rownumber') == 1 ,first("backup_name",ignorenulls=True).over(overColumnsMain)).otherwise(None)).show()
+-----------+--------+----------+----------------+-----+
|backup_name|emp_name|leave_date|backup_rownumber| bkup|
+-----------+--------+----------+----------------+-----+
| Noah| Jane|09/03/2022| 1| Noah|
| Peter| Jane|09/03/2022| 2| null|
| Susan| Jane|09/03/2022| 3| null|
| Noah| John|09/03/2022| 1| Noah|Noah wrongly assigned twice on the same day
| Peter| John|09/03/2022| 2| null|
| Susan| John|09/03/2022| 3| null|
| Joe| Ashley|10/03/2022| 1| Joe|
| Peter| Jane|10/03/2022| 1|Peter|
| Susan| Jane|10/03/2022| 2| null|
| Noah| John|10/03/2022| 1| Noah|
| Stephen| John|10/03/2022| 2| null|
+-----------+--------+----------+----------------+-----+
So in-order to restrict Noah from appearing twice on the backup list on the same day, i need to create a new Window that with only leave_date column and check it dynamically to see that Noah is already assigned or not. I am looking for better ways to approach this problem, writing a dedicated UDF would be a good option here?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
