'How to read partitions from s3 data with multiple folder hierarchies using pyspark

I have tried reading files with multiple key=value in prefix as partitions, though it works well with directories with fewer date partitions, it fails to list all the date partitions in my case (552 date partitions), and eventually times out.

Ex: ** s3://bucketA/bronze/sc_reports/p_report_type=daily_transaction_reports/p_report_date=20200901/abc.parquet s3://bucketA/bronze/sc_reports/p_report_type=daily_transaction_reports/p_report_date=20200902/xyz.parquet **

The code I use :

spark = SparkSession.builder.appName(appName).getOrCreate()

df = spark.read.format(parquet).load(
            "s3://bucketA/bronze/sc_reports")

df.createOrReplaceTempView("table1")

df2 = spark.sql("select * from table1 where p_report_date=$date ")

enter code here

What's the optimal way to read multiple partitions with spark ?

apache-spark pyspark apache-spark-sql

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to read partitions from s3 data with multiple folder hierarchies using pyspark

Sources

Related Questions