'How to improve reading (listing) 25k of small files from s3 by Spark

I have 25,000 small files on Minio S3 to parse.

df = spark.read.text("s3a://bucket/*/*/file*.txt").withColumn("path", input_file_name())
# parsing
# writing to parquet

Parsing and writing to parquet is fast. But listing files by s3 api took about 40 minutes. Question, how to make listing faster?

I using Spark 3.1.1 with Hadoop 3.2.



Solution 1:[1]

This is really fast:

    df = spark.read.option("pathGlobFilter", "file*.txt"). \
    option("recursiveFileLookup", "true"). \
    text(f"s3a://bucket/").withColumn("path", input_file_name())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 lubom