'Spark Scala program only prints first line in Google Dataproc
I have used SBT to build a jar file, and have tried to expand beyond the "Hello, world!" first code on Google Dataproc. When I submit the job, it only prints the first "Hello, world!" and does not print the reduce of the rdd (should be 30) nor the "...after...". Of course, these three println work correctly if I run on spark-shell on my laptop. Any pointers would be great, I can't seem to find out what Google Dataproc is looking for.
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ListBuffer
import org.apache.log4j.{Level, Logger}
object HelloWorld {
def main(args: Array[String]): Unit = {
println("Hello, world!")
val sc = SparkSession.builder().master("local").getOrCreate().sparkContext
val rdd = sc.parallelize(Array(5, 10, 15))
println(rdd.reduce(_+_))
println("...after...")
}
}
I'm not sure if it's needed, but my build.sbt file is below:
name := "HelloWorld"
version := "0.1"
scalaVersion := "2.12.3"
// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.0.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.0.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-graphx
libraryDependencies += "org.apache.spark" %% "spark-graphx" % "3.0.1"
artifactName := { (sv: ScalaVersion, module: ModuleID, artifact: Artifact) => "HelloWorld.jar" }
Solution 1:[1]
what is the default cluster manager in google data proc?
you can remove master("local") from your code and run, so it will run on default cluster manager, when you don't supply any arguments during spark-submit.
val sc = SparkSession.builder().getOrCreate().sparkContext
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Karthik |
