'What decides the number of jobs in a spark application
Previously my understanding was , an action will create a job in spark application. But let's see below scenario where I am just creating a dataframe using .range() method
df=spark.range(10)
Since my spark.default.parallelism is 10, resultant dataframe is of 10 partitions. Now I am just performing a .show() and .count() actions on dataframe
df.show()
df.count()
Now when I have checked spark history I can see 3 jobs for .show() and 1 job for .count()
Why 3 jobs are here for .show() method?
I have read some where .show() will eventually call .take() internally and it will iterate through partitions which decides the number of jobs . But I didn't understand that part? What exactly decides the number of jobs ?
Solution 1:[1]
The similar question has been asked many times at StackOverflow. For example:
- Why is Spark creating multiple jobs for one action
- How Spark invokes number of JOBS when action is triggered?
The reason underlying it is quite obvious after reading the source code of Spark.
Background knowledge: RDD is the fundamental data structure of the Spark, so Dataset (and Dataframe) will also use the API of RDD during runtime.
The call stack is: show() method will call showString(), and showString() -> getRows() -> take(n) -> head(n). And finally, it will lead to RDD's take(n).
while (buf.size < num && partsScanned < totalParts) {
...
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)
res.foreach(buf ++= _.take(num - buf.size))
partsScanned += p.size
}
By default, there are 12 partitions, and depending on the parameter n, there may be several jobs started by take.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |

