'How to read partitions from s3 data with multiple folder hierarchies using pyspark
I have tried reading files with multiple key=value in prefix as partitions, though it works well with directories with fewer date partitions, it fails to list all the date partitions in my case (552 date partitions), and eventually times out.
Ex: ** s3://bucketA/bronze/sc_reports/p_report_type=daily_transaction_reports/p_report_date=20200901/abc.parquet s3://bucketA/bronze/sc_reports/p_report_type=daily_transaction_reports/p_report_date=20200902/xyz.parquet **
The code I use :
spark = SparkSession.builder.appName(appName).getOrCreate()
df = spark.read.format(parquet).load(
"s3://bucketA/bronze/sc_reports")
df.createOrReplaceTempView("table1")
df2 = spark.sql("select * from table1 where p_report_date=$date ")
enter code here
What's the optimal way to read multiple partitions with spark ?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
