'How to force spark worker to run a python function
I want to run some data in Databricks (i.e. I'm using a spark cluster) through a C binary, but I'd like to do it as quickly as possible by having each worker node take on a part of the load. To do that, I have to distribute the binary to each of the nodes, which I'm attempting to do like so:
import shutil
def copyFile(filepath):
shutil.copyfile(f'/dbfs{filepath}', filepath)
os.system(f'chmod u+x {filepath}')
# The cluster has 16 workers
sc.parallelize(range(16)).repartition(16).map(lambda s: copyFile("/tmp/fast_func"))
My attempt there is to create exactly 16 partitions in the hopes that each of the 16 workers is therefore running the copyFile() function a single time (the /dbfs directory is shared between them all, but I don't seem to be able to set execute permissions there). I am then trying to run the binary like so:
output = inputDf.rdd.repartition(16).map(lambda s: f"{s.SearchString},{s.DisplayName}").pipe("/tmp/fast_func").collect()
The inputDf is two columns named SearchString and DisplayName, which I'm concatenating into a single string and piping into the binary (which reads from stdin). Broadly speaking this works, but sometimes instead results in an error saying that the /tmp/fast_func file cannot be found, from which I assume that my attempt to force copyFile() to be run on each worker must not be correct.
How can I force this to work properly?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
