'How can I write to Parquet faster using java?

I'm trying to write a Dataset object as a Parquet file using java.

I followed this example to do so but it is absurdly slow.

It takes ~1.5 minutes to write ~10mb of data, so it isn't going to scale well when I want to write hundreds of mb of data. I did some cpu profiling and found that 99% of the time came from the ParquetWriter.write() method.

I tried increasing the page size and block size of the ParquetWriter but it doesn't seem to have any effect on the performance. Is there any way to make this process faster or is it just a limitation of the Parquet library?



Solution 1:[1]

I've had reasonable luck using org.apache.parquet.hadoop.ParquetWriter to write org.apache.parquet.example.data.Group made by the org.apache.parquet.example.data.simple.SimpleGroupFactory.

https://github.com/apache/parquet-mr/blob/master/parquet-benchmarks/src/main/java/org/apache/parquet/benchmarks/DataGenerator.java

I'd love to know of a faster way (more columns x rows per second per thread).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 wangd