'how to benchmark pypspark queries?

I have got a simple pyspark script and I would like to benchmark each section.

# section 1: prepare data
df = spark.read.option(...).csv(...)
df.registerTempTable("MyData")

# section 2: Dataframe API
avg_earnings = df.agg({"earnings": "avg"}).show()

# section 3: SQL
avg_earnings = spark.sql("""SELECT AVG(earnings)
                            FROM MyData""").show()

Do generate reliable measurements one would need to run each section multiple times. My solution using the python time module looks like this.

import time
for _ in range(iterations):
    t1 = time.time()
    df = spark.read.option(...).csv(...)
    df.registerTempTable("MyData")

    t2 = time.time()
    avg_earnings = df.agg({"earnings": "avg"}).show()

    t3 = time.time()
    avg_earnings = spark.sql("""SELECT AVG(earnings)
                            FROM MyData""").show()
    t4 = time.time()
   
    write_to_csv(t1, t2, t3, t4)

My Question is how would one benchmark each section ? Would you use the time-module as well ? How would one disable caching for pyspark ?

Edit: Plotting the first 5 iterations of the benchmark shows that pyspark is doing some form of caching.

How can I disable this behaviour ?

Solution 1:^[1]

First, you can't benchmark using show, it only calculates and returns the top 20 rows.

Second, in general, PySpark API and Spark SQL share the same Catalyst Optimizer behind the scene, so overall what you are doing (using .agg vs avg()) is pretty much similar and don't have much difference.

Third, usually, benchmarking is only meaningful if your data is really big, or your operation is much longer than expected. Other than that, if the runtime difference is only a couple of minutes, it doesn't really matter.

Anyway, to answer your question:

Yes, there is nothing wrong to use time.time() to measure.
You should use count() instead of show(). count would go forward and compute your entire dataset.
You don't have to worry about cache if you don't call it. Spark won't cache unless you ask for it. In fact, you shouldn't cache at all when benchmarking.
You should also use static allocation instead of dynamic allocation. Or if you're using Databricks or EMR, use a fixed amount of workers and don't auto-scale it.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	pltc

'how to benchmark pypspark queries?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]