'SparkSQL - Difference between two time stamps in minutes irrespective of seconds
I am trying to convert sql version of a code to pyspark version. There is a requirement to get difference in minutes from 2 time stamps.
Inputs:
-------------------+-------------------+-
|date_time_closed |date_time_opened |
+-------------------+-------------------+
|2021-11-23 18:30:10|2021-11-23 18:25:46|
+-------------------+-------------------
Output required is 5.
SQL version of code is
avg(DATEDIFF(minute, date_time_opened, date_time_closed)) as avg_wait_time
Output: avg_wait_time: 5
Spark-Sql: I have tried below spark-sql codes to get value but its calculating seconds also
-
CAST(avg((unix_timestamp(date_time_closed)-unix_timestamp(date_time_opened))/(60)) as INT) as avg_wait_time
Output: avg_wait_time: 4
CAST(avg(((bigint(to_timestamp(date_time_closed)))-(bigint(to_timestamp(date_time_opened))))/(60))as INT) as avg_wait_time
Output: avg_wait_time: 4
is there any way that i can get 5? ie just the minute difference
Solution 1:[1]
You can ceil the results as below to convert it to the nearest integer value
Data Preparation
s = StringIO("""
date_time_closed date_time_opened
2021-11-23 18:30:10 2021-11-23 18:25:46
""")
df = pd.read_csv(s,delimiter='\t')
sparkDF = sql.createDataFrame(df)
sparkDF.show(truncate=False)
sparkDF = sparkDF.withColumn('date_time_closed_epoch',F.unix_timestamp(F.col('date_time_closed'),'yyyy-MM-dd HH:mm:ss'))\
.withColumn('date_time_opened_epoch',F.unix_timestamp(F.col('date_time_opened'),'yyyy-MM-dd HH:mm:ss'))
sparkDF.show()
+-------------------+-------------------+
|date_time_closed |date_time_opened |
+-------------------+-------------------+
|2021-11-23 18:30:10|2021-11-23 18:25:46|
+-------------------+-------------------+
+-------------------+-------------------+----------------------+----------------------+
| date_time_closed| date_time_opened|date_time_closed_epoch|date_time_opened_epoch|
+-------------------+-------------------+----------------------+----------------------+
|2021-11-23 18:30:10|2021-11-23 18:25:46| 1637672410| 1637672146|
+-------------------+-------------------+----------------------+----------------------+
Ceil
sparkDF = sparkDF.withColumn('avg_wait_time',
(F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60
)\
.withColumn('avg_wait_time_ceil',F.ceil(
(F.col('date_time_closed_epoch') - F.col('date_time_opened_epoch'))/60)
)
sparkDF.show(truncate=False,vertical=True)
-RECORD 0-------------------------------------
date_time_closed | 2021-11-23 18:30:10
date_time_opened | 2021-11-23 18:25:46
date_time_closed_epoch | 1637672410
date_time_opened_epoch | 1637672146
avg_wait_time | 4.4
avg_wait_time_ceil | 5
Solution 2:[2]
Dont mess around with fiddly epoch calculations. Just tell it you want to have the value precise to the minute and subtract those.
=> SELECT date_trunc('minute', current_timestamp);
date_trunc
????????????????????????
2022-05-19 08:44:00+00
(1 row)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Vaebhav |
| Solution 2 | Richard Huxton |
