'Spark glob filter to match a specific nested partition

I'm using Pyspark, but I guess this is valid to scala as well
My data is stored on s3 in the following structure

 main_folder
└──  year=2022
   └──  month=03
      ├──  day=01
      │  ├──  valid=false
      │  │  └──  example1.parquet
      │  └──  valid=true
      │     └──  example2.parquet
      └──  day=02
         ├──  valid=false
         │  └──  example3.parquet
         └──  valid=true
            └──  example4.parquet

(For simplicity there is only one file in any folder, and only two days, in reality, there can be thousands of files and many days/months/years)

The files that are under the valid=true and valid=false partitions have a completely different schema, and I only want to read the files in the valid=true partition

I tried using the glob filter, but it fails with AnalysisException: Unable to infer schema for Parquet. It must be specified manually. which is a symptom of having no data (so no files matched)

spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*)

I noticed that something like this works
spark.read.parquet('s3://main_folder', pathGlobFilter='*example4*)
however, as soon as I try to use a slash or do something above the bottom level it fails.
spark.read.parquet('s3://main_folder', pathGlobFilter='*/example4*)
spark.read.parquet('s3://main_folder', pathGlobFilter='*valid=true*example4*)

I did try to replace the * with ** in all locations, but it didn't work



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source