'to_timestamp() function in python giving null value in return
I have a data set as follows in a CSV format
| ORDER_ID | ORDER_TIME |
|---|---|
| 8799 | 11/11/2008 01:34:16.564000 AM |
| 8854 | 4/18/2008 01:02:36.564000 AM |
| 8884 | 4/20/2008 10:38:44.886000 PM |
| 8899 | 12/16/2008 07:32:59.456000 AM |
| 8681 | 7/29/2008 08:59:06.250000 PM |
and we are reading the file with the following read function in python
from pyspark.sql.types import *
from pyspark.sql import functions as F
df = spark.read \
.option("header",True) \
.option("nullValue", "null") \
.option("delimiter",",") \
.option("multiLine",True) \
.csv( csvfile, encoding="utf-8")
df = df.withColumn("ORDER_TIME", F.to_timestamp(F.unix_timestamp("ORDER_TIME", 'M/d/yyyy hh:mm:ss.SSSSSS a').cast('timestamp')))
When we run df.show() for the column ORDER_TIME we are getting null value
but I need it in standard spark format which is 2008-11-11 01:34:16
The same command if I run it in pyspark terminal the output is proper for the same input.
All our packages are up to date and I have no idea about this issue. Looking forward to the solution
Solution 1:[1]
approach by using udf:
import datetime
from pyspark.sql.functions import col
df = spark.createDataFrame(
[(8854, "11/11/2008 01:34:16.564000 AM"), (8799, "4/18/2008 01:02:36.564000 AM")], ("ORDER_ID", "ORDER_TIME"))
def standard_date_format(date):
val = datetime.datetime.strptime(date, '%m/%d/%Y %H:%M:%S.%f %p').strftime('%Y-%m-%d %H:%M:%S')
return val
fn1 = udf(standard_date_format)
df =df.withColumn('ORDER_TIME',fn1(col('ORDER_TIME')))
display(df)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Sudhin |

