'Spark: Different results of different uses of map for the same purpose
I want to test the order in which the data is executed in RRD, so I write two versions of the code as below.
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("OperatorPara")
val sc = new SparkContext(sparkConf)
val rdd: RDD[Int] = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9), 4)
// version 1
val rdd2: RDD[Int] = rdd.map { x: Int =>
println(s">>> $x")
x
}
val rdd3: RDD[Int] = rdd2.map { x: Int =>
println(s"### $x")
x
}
rdd3.collect()
// version 2
rdd.map((x: Int) => {println(s"+++ $x");x}).map((x:Int) => {println(s"--- $x"); x}).collect()
sc.stop()
}
And the result is like:
(version 1)
>>> 7
>>> 5
### 7
>>> 1
>>> 3
### 1
>>> 8
### 5
### 8
>>> 2
### 2
### 3
>>> 9
>>> 6
### 9
>>> 4
### 6
### 4
(version 2)
+++ 7
+++ 3
--- 3
+++ 4
--- 4
--- 7
+++ 8
--- 8
+++ 9
--- 9
+++ 5
--- 5
+++ 1
--- 1
+++ 2
--- 2
+++ 6
--- 6
According to the results, I found that the code of version 1 preferred to execute two maps separately, while the code of version 2 preferred to execute two maps continuously.
I tried several times and the results didn't make much difference.
My environment configuration is scala 2.11.12
, spark 2.3.2
and hadoop-3.2.2
.
I wonder if there is any potential difference in the execution of the two versions of code, or is it just a coincidence. Thanks a lot!
Solution 1:[1]
You have to consider that Spark distributes the operation across your cores. When you create a RDD from a collection using makeRDD function or parallelize without specifying the number of partitions, Spark will create a DAG to process the RDD, and depending on the number of cores several tasks will be executed in parallel. So, the output that you obtain is not determinist. If you set:
setMaster("local[1]")
You will see the data in order because only one partition is created. Check it out to see the difference.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Emiliano Martinez |