'Run shell script which is on hdfs location from a pyspark script
I have a requirement where i have to run call a shell script which is on hdfs location and run the shell script from a pyspark script.
My code is something like this,
bashcommand=“hadoop fs -cat (0) |exec sh -s (1)". format (shell script, hqlfile)
subprocess.Popen (bashcommand.split 0),
stdout=subprocess.PIPE)
# hqlfile is my parameter for shellscript
The subprocess.Popen is not working here.
Any help is appreciated Note : (I am running the pyspark script by firing spark-submit)
#####Update
bashCommand='hadoop fs -cat /bin/test/ingest.sh|exec sh -s /bin/test/hql/test.hql'
This is my command which I am trying to execute using
os.system(bashCommand)
The above code I have written in pyspark script and triggering pyspark script through spark-submit
My ingest.sh script contains
beeline -u "jdbc:hive2:************" -f $hql_file_path
My beeline command works perfectly fine when I run it on the edgenode and also when i run the shellscript ingest.sh on edgenode directly then also the beeline runs perfectly fine. The issue is only when I trigger it through a spark-Submit
code flow:
pyspark-->
bashCommand='hadoop fs -cat /bin/test/ingest.sh|exec sh -s /bin/test/hql/test.hql' os.system(bashCommand)
shell script(ingest.sh)--->
beeline -u "jdbc:hive2:************" -f $hql_file_path
Error when triggered the PySpark script:
22/04/23 17:37:26 WARN ipc.Client: Exception encountered while connecting to the server :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby
sh: line 9: beeline: command not found
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
