'DataBricks 10.2 pyspark 3.2.0; How Do I Add a New Timestamp Column Based on Another Date and Integer (Hours) Column?

In DataBricks notebook using pyspark I need to create/add a new timestamp column based on an existing date column while adding hours to it based on an existing hours-bin integer column - this is to support the creation of an event-driven time-series feature set, which requires in this case that the timestamp be limited to date and hour (no minutes, seconds, etc...). I have tried using string-based expr(), date_add(), various formatted-string and cast() combinations but I get a maddening slew of errors related to column access, parsing issues and the like. What is the simplest way to accomplish this?



Solution 1:[1]

In my opinion, unix_timestamp is the simplest method:

dfResult = dfSource.withColumn("yourNewTimestampColName",
   (unix_timestamp(col("yourExistingDateCol")) + 
   (col("yourExistingHoursCol")*3600)).cast("timestamp"))

Where yourNewTimestampColName represents the name of the timestamp column that you want to add, yourExistingDateCol represents a date column that must be present with this name within the dfSource dataframe and yourExistingHoursCol represents an integer-based hour column that must also be present with this name within the dfSource dataframe.

The unix_timestamp() method adds to the date in seconds, so to add hours multiply yourExistingHoursCol by 3,600 seconds, to add minutes multiply by 60, to add days multiply 3,600*24, etc....

Executing display(dfResult) should show structure/content of the dfSource dataframe with a new column named yourNewTimestampColName containing the date/hour combination requested.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Alex Ott