'How to improve reading (listing) 25k of small files from s3 by Spark
I have 25,000 small files on Minio S3 to parse.
df = spark.read.text("s3a://bucket/*/*/file*.txt").withColumn("path", input_file_name())
# parsing
# writing to parquet
Parsing and writing to parquet is fast. But listing files by s3 api took about 40 minutes. Question, how to make listing faster?
I using Spark 3.1.1 with Hadoop 3.2.
Solution 1:[1]
This is really fast:
df = spark.read.option("pathGlobFilter", "file*.txt"). \
option("recursiveFileLookup", "true"). \
text(f"s3a://bucket/").withColumn("path", input_file_name())
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | lubom |
