'Pyspark update record based on last value using timestamp and column value
I'm struggling to figure this out. I need to find the last record with reason backfill and update the non backfill record with the greatest timestamp.
Here is what I've tried -
w = Window.orderBy("idx")
w1 = Window.partitionBy('reason').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df_uahr.withColumn('idx',F.monotonically_increasing_id()).withColumn("app_data_new",F.last(F.lead("app_data").over(w)).over(w1)).orderBy("idx").show()
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id| reason| created_at| updated_at| app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
| 2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26| app_data_a| 0| app_data_c|
| 2| backfill|2020-05-29 17:47:09|2021-05-29 17:47:09| app_data_c| 1| null|
| 2| backfill|2022-03-09 09:47:09|2022-03-09 09:47:09| app_data_d| 2| null|
| 2| test|2022-04-09 09:47:09|2022-04-09 09:47:09| app_data_e| 3| app_data_f|
| 2| test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a| 4| app_data_f|
| 2| backfill|2022-05-09 09:47:09|2022-05-09 09:47:09| app_data_f| 5| null|
| 2| after|2023-04-09 09:47:09|2023-04-09 09:47:09| app_data_g| 6| app_data_h|
| 2| backfill|2023-05-09 09:47:09|2023-05-09 09:47:09| app_data_h| 7| null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
Expected value
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
|upstart_application_id| reason| created_at| updated_at| app_data|idx|app_data_new|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
| 2|disqualified |2018-07-12 15:57:26|2018-07-12 15:57:26| app_data_a| 0| app_data_d|
| 2| backfill|2020-05-29 17:47:09|2021-05-29 17:47:09| app_data_c| 1| null|
| 2| backfill|2022-03-09 09:47:09|2022-03-09 09:47:09| app_data_d| 2| null|
| 2| test|2022-04-09 09:47:09|2022-04-09 09:47:09| app_data_e| 3| null|
| 2| test|2022-04-19 09:47:09|2022-04-19 09:47:09|app_data_e_a| 4| app_data_f|
| 2| backfill|2022-05-09 09:47:09|2022-05-09 09:47:09| app_data_f| 5| null|
| 2| after|2023-04-09 09:47:09|2023-04-09 09:47:09| app_data_g| 6| app_data_h|
| 2| backfill|2023-05-09 09:47:09|2023-05-09 09:47:09| app_data_h| 7| null|
+----------------------+-------------+-------------------+-------------------+------------+---+------------+
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
