'Read data from AWS S3 using pyspark and python. (Read All columns: Partioned column also)

I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3:

df.write.option("header",True) \
        .partitionBy("channel_name") \
        .mode("overwrite") \
        .parquet("s3://path/")
channel_name start_timestamp value Outlier
TEMP 2021-07-19 07:27:51 21 false
TEMP 2021-07-19 08:21:05 24 false
Vel 2021-07-19 08:20:18 22 false
Vel 2021-07-19 08:21:54 26 false
TEMP 2021-07-19 08:21:23 25 false
TEMP 2021-07-16 08:22:41 88 false

As it was partitionby "channel_name",Now while reading the same data from S3 it is missing that column "channel_name". below is my code for pyspark and for python.

df = spark.read.parquet("s3://Path/") #spark

for Python i am using AWS wrangler:

import awswrangler as wr

df = wr.s3.read_parquet(path="s3://Path/")

This is how df looks like without column "channel_name".

start_timestamp value Outlier
2021-07-19 07:27:51 21 false
2021-07-19 08:21:05 24 false
2021-07-19 08:20:18 22 false
2021-07-19 08:21:54 26 false
2021-07-19 08:21:23 25 false
2021-07-16 08:22:41 88 false

How to read complete data including partition column, please let me know if there any alternative.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source