'How to use Window.unboundedPreceding, Window.unboundedFollowing on Distinct datetime

I have data like below

---------------------------------------------------|
|Id      |     DateTime                | products  |
|--------|-----------------------------|-----------|
|       1| 2017-08-24T00:00:00.000+0000| 1         |  
|       1| 2017-08-24T00:00:00.000+0000| 2         |
|       1| 2017-08-24T00:00:00.000+0000| 3         |
|       1| 2016-05-24T00:00:00.000+0000| 1         |

I am using window.unboundedPreceding , window.unboundedFollowing as below to get the second recent datetime.

sorted_times = Window.partitionBy('Id').orderBy(F.col('ModifiedTime').desc()).rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df3 = (data.withColumn("second_recent",F.collect_list(F.col('ModifiedTime')).over(sorted_times)).getItem(1)))

But I get the results as below,getting the second date from second row which is same as first row

------------------------------------------------------------------------------
|Id      |DateTime                     | secondtime                   |Products        
|--------|-----------------------------|----------------------------- |--------------
|       1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1
|       1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 2
|       1| 2017-08-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 3
|       1| 2016-05-24T00:00:00.000+0000| 2017-08-24T00:00:00.000+0000 | 1    

Please help me in finding the second latest datetime on distinct datetime. Thanks in advance



Solution 1:[1]

Use collect_set instead of collect_list for no duplicates:

df3 = data.withColumn(
    "second_recent",
    F.collect_set(F.col('LastModifiedTime')).over(sorted_times)[1]
)

df3.show(truncate=False)
#+-----+----------------------------+--------+----------------------------+
#|VipId|LastModifiedTime            |products|second_recent               |
#+-----+----------------------------+--------+----------------------------+
#|1    |2017-08-24T00:00:00.000+0000|1       |2016-05-24T00:00:00.000+0000|
#|1    |2017-08-24T00:00:00.000+0000|2       |2016-05-24T00:00:00.000+0000|
#|1    |2017-08-24T00:00:00.000+0000|3       |2016-05-24T00:00:00.000+0000|
#|1    |2016-05-24T00:00:00.000+0000|1       |2016-05-24T00:00:00.000+0000|
#+-----+----------------------------+--------+----------------------------+

Another way by using unordered window and sorting the array before taking second_recent:

from pyspark.sql import functions as F, Window

df3 = data.withColumn(
    "second_recent",
    F.sort_array(
        F.collect_set(F.col('LastModifiedTime')).over(Window.partitionBy('VipId')),
        False
    )[1]
)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 blackbishop