'Trigerring spark-submit in local mode from a remote server
I have a pyspark application that i am currently running in local mode. Going forward this needs to be deployed in production and the pyspark job needs to be trigerred from clients end. What changes do i need to make from my end so that it can be trigerred from clients end? I am using spark 3.2 and python3.6.
Currently i am executing this job by firing spark-submit command from the same server where spark is installed.
spark-submit --jars /app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar spark_job.py
1.First i think i need to specify jars in my spark-session
spark = SparkSession.builder.master("local[*]").appName('test1').config("spark.jars", "/app/some_path/lib/db.jar,/app/some_path/lib/thirdparthy.jar").getOrCreate()
This didnt work and erroring out with "java.lang.ClassNotFoundException" I think it is not able to find the jar though it works fine when i paas it from spark-submit. What is the right way of passing jars in spark session?
2.I dont have a spark cluster. It is more like a R&D project which process small dataset so i dont think i need a cluster yet. How shall i run spark-submit from a remote server at client's end? Shall i change master from local to my server's ip where spark is installed?
3.How client can trigger this job. Shall i write a python script with a subprocess trigerring this spark-submit and give it to client so that they can execute this python file at a specific time from their workflow manager tool?
import subprocess
spark_submit_str= "spark-submit --master "server-ip" spark_job.py"
process=subprocess.Popen(spark_submit_str,stdout=subprocess.PIPE,stderr=subprocess.PIPE, universal_newlines=True, shell=True)
stdout,stderr = process.communicate()
if process.returncode !=0:
print(stderr)
print(stdout)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
