'Run Pyspark application inside Spark application
Due to a specific need, I am solving the problem of running a python application inside a scala application.
Here is the sample code of my applications.
Parent application (Scala):
import scala.sys.process._
import java.io.File
import java.nio.file.{Path => JavaPath}
class SparkApplication(spark: SparkSession) {
def run(): Unit = {
val command =
"spark-submit" ::
s"--keytab $keytab" ::
"--principal [email protected]" ::
"--master yarn" ::
"--deploy-mode cluster" ::
s"$pyFile" :: Nil mkString " "
command.!!
}
val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath
val pyFile: File = moveToContainer("spark.py")
val keytab: File = moveToContainer("my_username.keytab")
def moveToContainer(fileName: String): File = {
val fileSourceStream = getClass.getResourceAsStream(fileName)
val file = containerPath.resolve(fileName).toFile
FileUtils.copyInputStreamToFile(fileSourceStream, file)
file
}
}
Child application (spark.py):
from pyspark.sql import SparkSession
if __name__ == "__main__":
spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()
rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
print("RDD count")
print(rdd.count())
df = spark.sql("select 1 as col1")
df.write.format("parquet").saveAsTable("default.temp_pyspark_table")
So, as you can see above, I am successfully running a Scala application in cluster mode. This scala application runs a python application inside itself (spark.py ). Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.
But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application. That is, I can't save my test table from spark.py.
And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.
I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens? (--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION) Or maybe there are workarounds?
I still can't finish this problem.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
