'SparkSQL - Difference between two time stamps in minutes irrespective of seconds

I am trying to convert sql version of a code to pyspark version. There is a requirement to get difference in minutes from 2 time stamps.

Inputs:

-------------------+-------------------+-
|date_time_closed   |date_time_opened   |
+-------------------+-------------------+
|2021-11-23 18:30:10|2021-11-23 18:25:46|
+-------------------+-------------------

Output required is 5.

SQL version of code is

avg(DATEDIFF(minute, date_time_opened, date_time_closed)) as avg_wait_time

Output: avg_wait_time: 5

Spark-Sql: I have tried below spark-sql codes to get value but its calculating seconds also

  1. CAST(avg((unix_timestamp(date_time_closed)-unix_timestamp(date_time_opened))/(60)) as INT) as avg_wait_time
    

Output: avg_wait_time: 4

  1. CAST(avg(((bigint(to_timestamp(date_time_closed)))-(bigint(to_timestamp(date_time_opened))))/(60))as INT) as avg_wait_time

Output: avg_wait_time: 4

is there any way that i can get 5? ie just the minute difference



Solution 1:[1]

You can ceil the results as below to convert it to the nearest integer value

Data Preparation

s = StringIO("""
date_time_closed    date_time_opened
2021-11-23 18:30:10 2021-11-23 18:25:46
""")

df = pd.read_csv(s,delimiter='\t')

sparkDF = sql.createDataFrame(df)

sparkDF.show(truncate=False)

sparkDF = sparkDF.withColumn('date_time_closed_epoch',F.unix_timestamp(F.col('date_time_closed'),'yyyy-MM-dd HH:mm:ss'))\
                 .withColumn('date_time_opened_epoch',F.unix_timestamp(F.col('date_time_opened'),'yyyy-MM-dd HH:mm:ss'))

sparkDF.show()

+-------------------+-------------------+
|date_time_closed   |date_time_opened   |
+-------------------+-------------------+
|2021-11-23 18:30:10|2021-11-23 18:25:46|
+-------------------+-------------------+

+-------------------+-------------------+----------------------+----------------------+
|   date_time_closed|   date_time_opened|date_time_closed_epoch|date_time_opened_epoch|
+-------------------+-------------------+----------------------+----------------------+
|2021-11-23 18:30:10|2021-11-23 18:25:46|            1637672410|            1637672146|
+-------------------+-------------------+----------------------+----------------------+

Ceil

sparkDF = sparkDF.withColumn('avg_wait_time',
                                (F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60
                            )\
                .withColumn('avg_wait_time_ceil',F.ceil(
                                (F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60)
                            )

sparkDF.show(truncate=False,vertical=True)

-RECORD 0-------------------------------------
 date_time_closed       | 2021-11-23 18:30:10 
 date_time_opened       | 2021-11-23 18:25:46 
 date_time_closed_epoch | 1637672410          
 date_time_opened_epoch | 1637672146          
 avg_wait_time          | 4.4                 
 avg_wait_time_ceil     | 5

Solution 2:[2]

Dont mess around with fiddly epoch calculations. Just tell it you want to have the value precise to the minute and subtract those.

=> SELECT date_trunc('minute', current_timestamp);
       date_trunc       
????????????????????????
 2022-05-19 08:44:00+00
(1 row)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Vaebhav
Solution 2 Richard Huxton