'Read data from AWS S3 using pyspark and python. (Read All columns: Partioned column also)
I have saved the spark dataframe to AWS S3 in Parquet format partitionby column "channel_name". Below code is how i saved to S3:
df.write.option("header",True) \
.partitionBy("channel_name") \
.mode("overwrite") \
.parquet("s3://path/")
| channel_name | start_timestamp | value | Outlier |
|---|---|---|---|
| TEMP | 2021-07-19 07:27:51 | 21 | false |
| TEMP | 2021-07-19 08:21:05 | 24 | false |
| Vel | 2021-07-19 08:20:18 | 22 | false |
| Vel | 2021-07-19 08:21:54 | 26 | false |
| TEMP | 2021-07-19 08:21:23 | 25 | false |
| TEMP | 2021-07-16 08:22:41 | 88 | false |
As it was partitionby "channel_name",Now while reading the same data from S3 it is missing that column "channel_name". below is my code for pyspark and for python.
df = spark.read.parquet("s3://Path/") #spark
for Python i am using AWS wrangler:
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://Path/")
This is how df looks like without column "channel_name".
| start_timestamp | value | Outlier |
|---|---|---|
| 2021-07-19 07:27:51 | 21 | false |
| 2021-07-19 08:21:05 | 24 | false |
| 2021-07-19 08:20:18 | 22 | false |
| 2021-07-19 08:21:54 | 26 | false |
| 2021-07-19 08:21:23 | 25 | false |
| 2021-07-16 08:22:41 | 88 | false |
How to read complete data including partition column, please let me know if there any alternative.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
