'How to force spark worker to run a python function

I want to run some data in Databricks (i.e. I'm using a spark cluster) through a C binary, but I'd like to do it as quickly as possible by having each worker node take on a part of the load. To do that, I have to distribute the binary to each of the nodes, which I'm attempting to do like so:

import shutil

def copyFile(filepath):
    shutil.copyfile(f'/dbfs{filepath}', filepath)
    os.system(f'chmod u+x {filepath}')

# The cluster has 16 workers
sc.parallelize(range(16)).repartition(16).map(lambda s: copyFile("/tmp/fast_func"))

My attempt there is to create exactly 16 partitions in the hopes that each of the 16 workers is therefore running the copyFile() function a single time (the /dbfs directory is shared between them all, but I don't seem to be able to set execute permissions there). I am then trying to run the binary like so:

output = inputDf.rdd.repartition(16).map(lambda s: f"{s.SearchString},{s.DisplayName}").pipe("/tmp/fast_func").collect()

The inputDf is two columns named SearchString and DisplayName, which I'm concatenating into a single string and piping into the binary (which reads from stdin). Broadly speaking this works, but sometimes instead results in an error saying that the /tmp/fast_func file cannot be found, from which I assume that my attempt to force copyFile() to be run on each worker must not be correct.

How can I force this to work properly?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to force spark worker to run a python function

Sources

Related Questions