'Merging too many small files into single large files in Datalake using Apache Spark

I Have Following Directory Structure In HDFS.

/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt

I want to Merge the files DayWise.

/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt
/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt

I have used below code.

val inputDir="/user/hdfs/landing_zone/year=2021/month=11/"
val hadoopConf = spark.sparkContext.hadoopConfiguration
val hdfsConf = new Configuration();
val fs: FileSystem = FileSystem.get(hdfsConf)
val sc = spark.sparkContext
val baseFolder = new Path(inputDir)
val files = baseFolder.getFileSystem(sc.hadoopConfiguration).listStatus(baseFolder).map(_.getPath.toString)
for (path <- files) {
var Folder_Path = fs.listStatus(new Path(path)).map(_.getPath).toList
for (eachfolder <- Folder_Path) {
var New_Folder_Path: String = eachfolder.toString
var Fs1 = FileSystem.get(spark.sparkContext.hadoopConfiguration)
var FilePath = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath).toList
var NewFiles = Fs1.listStatus(new Path(s"${New_Folder_Path}")).filter(_.isFile).map(_.getPath.getName).toList

"FilePath" : Generating the List of Complete Path for all the files recursively.

List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=01/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=02/part-3.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-1.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-2.txt)
List(/user/hdfs/landing_zone/year=2021/month=11/day=03/part-3.txt)

"NewFiles" : - Generating the list of FileNames for all the files recursively

List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)
List(part-1.txt)
List(part-2.txt)
List(part-3.txt)

Can Someone Suggest/Guide me How should I modify the code so that It can Generate the files DayWise and merge 3 file(1 day=3 files) into a single file (1 day = 1 file) recursively for all the days.



Solution 1:[1]

They're are easier ways that getting into low level manipulations.I would suggest "picking the table up and putting it back down"

Literally, create a table based on the files, write it to a new table. This should concatenate the small files. (Without you having to manipulate it)

If you have created a hive table based on this you could use hive to do the work:

ALTER TABLE table_name [PARTITION (partition_key = 'partition_value' [, ...])] CONCATENATE;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matt Andruff