'ModuleNotFoundError seen after the first time a job is run on a Ray cluster
I'm running a script which imports a module from a file in the same directory. The first time I run the script after starting the cluster the script runs as expected. Any subsequent times I run the script I get the following error: ModuleNotFoundError: No module named 'ex_cls'
How do I get Ray to recognize modules I'm importing after the first run?
I am using Ray 1.11.0 on a redhat Linux cluster.
Here are my scripts. Both are located in the /home/ray_experiment directory:
--ex_main.py
import sys
sys.path.insert(0, '/home/ray_experiment')
from ex_cls import monitor_wrapper
import ray
ray.init(address='auto')
from ray.util.multiprocessing import Pool
def main():
pdu_infos = range(10)
with Pool() as pool:
results = pool.map(monitor_wrapper, [pdu for pdu in pdu_infos])
for pdu_info, result in zip(pdu_infos, results):
print(pdu_info, result)
if __name__ == "__main__":
main()
--ex_cls.py
import sys
from time import time, sleep
from random import randint
import collections
sys.path.insert(0, '/home/ray_experiment')
MonitorResult = collections.namedtuple('MonitorResult', 'key task_time')
def monitor_wrapper(args):
start = time()
rando = randint(0, 200)
lst = []
for i in range(10000 * rando):
lst.append(i)
pause = 1
sleep(pause)
return MonitorResult(args, time() - start)
-- Edit
I've found that by adding these two environment variables I no longer see the ModuleNotFoundError.
export PYTHONPATH="${PYTHONPATH}:/home/ray_experiment/"
export RAY_RUNTIME_ENV_WORKING_DIR_CACHE_SIZE_GB=0
Is there another solution that doesn't require disabling the working environment caching?
Solution 1:[1]
The issue here is that Ray's worker processes may be run from different working directories than your driver python script. In fact, on a cluster, they may even be run from different machines. This is coupled by the fact that python looks for the module based on a relative path (to be precise, cloudpickle serializes definitions in other modules by reference).
The "intended" solution to this problem is to use runtime environments.
In particular, you should do ray.init(address='auto', {"working_dir": "./"}) when starting Ray to ensure that the module is passed to other processes.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Alex |
