'PySpark extraction - Issue with timezone in timestamp columns
We are trying to extract data from Teradata using PySpark using parquet. Facing issue with timestamp columns. Same Parquet when read in Spark and Python gives different results.
For example: In teradata, data is
| DATE1 | DATETIME2 |
|---|---|
| 2021-01-01 | 2021-01-01 07:14:13 |
| 2021-01-02 | 2021-01-02 12:05:06 |
Extraction code
table_sql='''select * from table_name'''
df = spark.read.format('jdbc') \
.option('driver', driver) \
.option('url', url) \
.option('dbtable', '({sql}) as src'.format(sql=table_sql)) \
.option('user', user) \
.option('password', password) \
.load()
After extraction, spark df.show() gives same result but when read in python or loaded to data warehouse +4 hrs or +5hrs are added to different rows. Seems like timezone offset are being added.
When read in Python.
| DATE1 | DATETIME2 |
|---|---|
| 2021-01-01 | 2021-01-01 12:14:13 |
| 2021-01-02 | 2021-01-02 16:05:06 |
How to avoid of this conversion? Is it happening in PySpark extraction or parquet?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
