'Based on file read and writes speeds, which amongst ORC, Parquet & AVRO is best suited for each scenario? [closed]

I have been working with Spark and Hadoop ecosystem for some years but never bothered to question my architects about why a certain file format is chosen before they provide any explanation to the team and the developers. I am seeing that effect of dereliction of questioning now.

I have some background on ORC file format about the data being arranged in stripes, each stripe having Index data and some metadata of each column, file footer containing column-level aggregates count, min, max, and sum. Their conluence-page is well documented and easy to understand.

Based on this minimal knowledge, I can understand that ORC can provide better read speeds.

In the same way, can anyone explain which file format is best suited for writes ?

I have seen many articles just saying ORC is good for reads, Parquet is good for writing, if you have nested data, go for AVRO without bothering to explain the reasons behind it.

Could anyone take this herculian task of explaining which one amongst ORC, Parquet & AVRO is best suited for read/writes and the reasons behind the argument ?

Any help is massively appreciated.



Solution 1:[1]

Until you have a real performance need. Just be consistent. Parquet and ORC fight for the top spot and do have trade offs but just using either of them is such a boost over other uncompressed/unsplittable formats.

It doesn't matter if you use Parquet or ORC until you really are looking to squeeze performance. And that involves lots of optimization where you need to look at the bottleneck that's the issue. That's when you'll do some reading and find that you need to change formats to use 'X' feature. Until you get to that stage, just pick one and you are good.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Matt Andruff