'Why would I want to merge multiple pieces of parquet files into a single parquet file?

Let's say I have a CSV file with several hundreds of million records. Then I want to convert that CSV into a Parquet file using Python and Pandas to read the CSV and write the Parquet file. But because the file is too big to read it into memory and write a single Parquet file, I decided to read the CSV in chunks of 5M records and create a Parquet file for every chunk. Why would I want to merge all those of parquet files into a single parquet file?

Thanks in advance.



Solution 1:[1]

In general, it's the small files problem; for companies working with big data, file count limits can be an issue if one does not consistently control this problem.

It's a problem to be solved as there is no benefit for read performance if you split up files to small files (each parquet file consists of multiple row groups that ensures good parallelism during FileScan operations by itself).

However, jobs gravtitate towards small files problem because there is a benefit for write performance as creating too large of a parquet file with too many row groups before it is flushed as a file can be extremely memory intensive (cost in resources provisioned and duration wise).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Tony Ng