'Run Pyspark application inside Spark application

Due to a specific need, I am solving the problem of running a python application inside a scala application.

Here is the sample code of my applications.

Parent application (Scala):

import scala.sys.process._
import java.io.File
import java.nio.file.{Path => JavaPath}

class SparkApplication(spark: SparkSession) {

  def run(): Unit = {
    val command =
      "spark-submit"          ::
      s"--keytab $keytab"     ::
      "--principal [email protected]" ::
      "--master yarn"         ::
      "--deploy-mode cluster" ::
      s"$pyFile" :: Nil mkString " "

    command.!!
  }

  val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath

  val pyFile: File =  moveToContainer("spark.py")
  val keytab: File  = moveToContainer("my_username.keytab")

  def moveToContainer(fileName: String): File = {
    val fileSourceStream = getClass.getResourceAsStream(fileName)
    val file = containerPath.resolve(fileName).toFile
    FileUtils.copyInputStreamToFile(fileSourceStream, file)
    file
  }
}

Child application (spark.py):

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()

    rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
    print("RDD count")
    print(rdd.count())

    df = spark.sql("select 1 as col1")
    df.write.format("parquet").saveAsTable("default.temp_pyspark_table")

So, as you can see above, I am successfully running a Scala application in cluster mode. This scala application runs a python application inside itself (spark.py ). Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.

But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application. That is, I can't save my test table from spark.py.

And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.

I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens? (--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION) Or maybe there are workarounds?

I still can't finish this problem.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Run Pyspark application inside Spark application

Sources

Related Questions