'Spark glob filter to match a specific nested partition
I'm using Pyspark, but I guess this is valid to scala as well
My data is stored on s3 in the following structure
main_folder
└── year=2022
└── month=03
├── day=01
│ ├── valid=false
│ │ └── example1.parquet
│ └── valid=true
│ └── example2.parquet
└── day=02
├── valid=false
│ └── example3.parquet
└── valid=true
└── example4.parquet
(For simplicity there is only one file in any folder, and only two days, in reality, there can be thousands of files and many days/months/years)
The files that are under the valid=true and valid=false partitions have a completely different schema, and I only want to read the files in the valid=true partition
I tried using the glob filter, but it fails with AnalysisException: Unable to infer schema for Parquet. It must be specified manually. which is a symptom of having no data (so no files matched)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*)
I noticed that something like this worksspark.read.parquet('s3://main_folder', pathGlobFilter='*example4*)
however, as soon as I try to use a slash or do something above the bottom level it fails.spark.read.parquet('s3://main_folder', pathGlobFilter='*/example4*)spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*example4*)
I did try to replace the * with ** in all locations, but it didn't work
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
