'Best way to benchmark spark reading time

What is the best way to benchmark the reading time of spark ?

    val rdd = spark.sparkContext.binaryFiles(s"$Path//$partitionColumn=$partitionId/*.avro")
implicit val streamEncoder: Encoder[(String, PortableDataStream)] = Encoders.kryo[(String, PortableDataStream)]
spark.createDataset(rdd)

I use spark 2.2



Solution 1:[1]

I suggest to use this library: https://github.com/LucaCanali/sparkMeasure.

Check examples available in the Readme file. Like this Databrick notebook.

For instance you could read your Avro using the runAndMeasure function:

taskMetrics.runAndMeasure(spark.createDataset(rdd).count())

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1