'PySpark extraction - Issue with timezone in timestamp columns

We are trying to extract data from Teradata using PySpark using parquet. Facing issue with timestamp columns. Same Parquet when read in Spark and Python gives different results.

For example: In teradata, data is

DATE1 DATETIME2
2021-01-01 2021-01-01 07:14:13
2021-01-02 2021-01-02 12:05:06

Extraction code

table_sql='''select * from table_name'''
df = spark.read.format('jdbc') \
                    .option('driver', driver) \
                    .option('url', url) \
                    .option('dbtable', '({sql}) as src'.format(sql=table_sql)) \
                    .option('user', user) \
                    .option('password', password) \
                    .load()

After extraction, spark df.show() gives same result but when read in python or loaded to data warehouse +4 hrs or +5hrs are added to different rows. Seems like timezone offset are being added.

When read in Python.

DATE1 DATETIME2
2021-01-01 2021-01-01 12:14:13
2021-01-02 2021-01-02 16:05:06

How to avoid of this conversion? Is it happening in PySpark extraction or parquet?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source