'What is open cost bytes in spark?

What does this property spark.sql.files.openCostInBytes do ?

This is official document definition:

The estimated cost to open a file, measured by the number of bytes could be scanned in the same time. This is used when putting multiple files into a partition. It is better to over-estimated, then the partitions with small files will be faster than partitions with bigger files (which is scheduled first). This configuration is effective only when using file-based sources such as Parquet, JSON and ORC.

But still didn't get it. Can anyone explain with small example that why and where its useful?

apache-spark

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'What is open cost bytes in spark?

Sources

Related Questions