'Pyspark update record based on last value using timestamp and column value

I'm struggling to figure this out. I need to find the last record with reason backfill and update the non backfill record with the greatest timestamp.

Here is what I've tried -

w = Window.orderBy("idx")
w1 = Window.partitionBy('reason').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df_uahr.withColumn('idx',F.monotonically_increasing_id()).withColumn("app_data_new",F.last(F.lead("app_data").over(w)).over(w1)).orderBy("idx").show()

+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id|       reason|         created_at|         updated_at|    app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|                     2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26|  app_data_a|  0|  app_data_c|
|                     2|     backfill|2020-05-29 17:47:09|2021-05-29 17:47:09|  app_data_c|  1|        null|
|                     2|     backfill|2022-03-09 09:47:09|2022-03-09 09:47:09|  app_data_d|  2|        null|
|                     2|         test|2022-04-09 09:47:09|2022-04-09 09:47:09|  app_data_e|  3|  app_data_f|
|                     2|         test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a|  4|  app_data_f|
|                     2|     backfill|2022-05-09 09:47:09|2022-05-09 09:47:09|  app_data_f|  5|        null|
|                     2|        after|2023-04-09 09:47:09|2023-04-09 09:47:09|  app_data_g|  6|  app_data_h|
|                     2|     backfill|2023-05-09 09:47:09|2023-05-09 09:47:09|  app_data_h|  7|        null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+

Expected value

+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id|       reason|         created_at|         updated_at|    app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|                     2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26|  app_data_a|  0|  app_data_d|
|                     2|     backfill|2020-05-29 17:47:09|2021-05-29 17:47:09|  app_data_c|  1|        null|
|                     2|     backfill|2022-03-09 09:47:09|2022-03-09 09:47:09|  app_data_d|  2|        null|
|                     2|         test|2022-04-09 09:47:09|2022-04-09 09:47:09|  app_data_e|  3|        null|
|                     2|         test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a|  4|  app_data_f|
|                     2|     backfill|2022-05-09 09:47:09|2022-05-09 09:47:09|  app_data_f|  5|        null|
|                     2|        after|2023-04-09 09:47:09|2023-04-09 09:47:09|  app_data_g|  6|  app_data_h|
|                     2|     backfill|2023-05-09 09:47:09|2023-05-09 09:47:09|  app_data_h|  7|        null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source