'Merging too many small files into single large files in Datalake using Apache Spark
I Have Following Directory Structure In HDFS.
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt
I want to Merge the files DayWise.
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
I have used below code.
val inputDir="/user/hdfs/landing_zone/year=2021/month=11/"
val hadoopConf = spark.sparkContext.hadoopConfiguration
val hdfsConf = new Configuration();
val fs: FileSystem = FileSystem.get(hdfsConf)
val sc = spark.sparkContext
val baseFolder = new Path(inputDir)
val files = baseFolder.getFileSystem(sc.hadoopConfiguration).listStatus(baseFolder).map(_.getPath.toString)
for (path <- files) {
var Folder_Path = fs.listStatus(new Path(path)).map(_.getPath).toList
for (eachfolder <- Folder_Path) {
var New_Folder_Path: String = eachfolder.toString
var Fs1 = FileSystem.get(spark.sparkContext.hadoopConfiguration)
var FilePath = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath).toList
var NewFiles = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath.getName).toList
"FilePath" : Generating the List of Complete Path for all the files recursively.
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt)
"NewFiles" : - Generating the list of FileNames for all the files recursively
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
Can Someone Suggest/Guide me How should I modify the code so that It can Generate the files DayWise and merge 3 file(1 day=3 files) into a single file (1 day = 1 file) recursively for all the days.
Solution 1:[1]
They're are easier ways that getting into low level manipulations.I would suggest "picking the table up and putting it back down"
Literally, create a table based on the files, write it to a new table. This should concatenate the small files. (Without you having to manipulate it)
If you have created a hive table based on this you could use hive to do the work:
ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Andruff |
