'What decides the number of jobs in a spark application

Previously my understanding was , an action will create a job in spark application. But let's see below scenario where I am just creating a dataframe using .range() method

df=spark.range(10)

Since my spark.default.parallelism is 10, resultant dataframe is of 10 partitions. Now I am just performing a .show() and .count() actions on dataframe

df.show()
df.count()

Now when I have checked spark history I can see 3 jobs for .show() and 1 job for .count()

Why 3 jobs are here for .show() method?

I have read some where .show() will eventually call .take() internally and it will iterate through partitions which decides the number of jobs . But I didn't understand that part? What exactly decides the number of jobs ?

apache-spark pyspark

Solution 1:^[1]

The similar question has been asked many times at StackOverflow. For example:

The reason underlying it is quite obvious after reading the source code of Spark.

Background knowledge: RDD is the fundamental data structure of the Spark, so Dataset (and Dataframe) will also use the API of RDD during runtime.

The call stack is: show() method will call showString(), and showString() -> getRows() -> take(n) -> head(n). And finally, it will lead to RDD's take(n).

while (buf.size < num && partsScanned < totalParts) {
        ...
        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
        res.foreach(buf ++= _.take(num - buf.size))
        partsScanned += p.size
}

By default, there are 12 partitions, and depending on the parameter n, there may be several jobs started by take.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'What decides the number of jobs in a spark application

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]