'Worker process still alive after 0 seconds, killing
I submit two Dask containers with my scheduler (PBS) like that :
#!/usr/bin/env bash
#PBS -N MyApp
#PBS -q my_queue
#PBS -l select=1:ncpus=1:mem=2GB
#PBS -l walltime=00:30:00
#PBS -m n
/.../bin/python -m distributed.cli.dask_worker tcp://scheduler:53815 --nanny --death-timeout 60
The first worker successfully connect to the scheduler :
distributed.nanny - INFO - Start Nanny at: 'tcp://...:48652'
distributed.worker - INFO - Start worker at: tcp://...:33401
distributed.worker - INFO - Listening to: tcp://...:33401
distributed.worker - INFO - dashboard at: ...:54725
distributed.worker - INFO - Waiting to connect to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Threads: 1
distributed.worker - INFO - Memory: 1.86 GiB
distributed.worker - INFO - Local Directory: /.../
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Registered to: tcp://...:48272
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection
distributed.dask_worker - INFO - Exiting on signal 15
distributed.nanny - INFO - Closing Nanny at 'tcp://...:48652'
Terminated
(the signal 15 is OK. For REDHAT it means a simple SIGTERM, because I have terminated myself the container before it ends)
The problem is for the second worker :
The container of the worker is OK, but the worker never process any Dask tasks.
The logs are as follow :
distributed.nanny - INFO - Start Nanny at: 'tcp://...:51682'
distributed.nanny - INFO - Closing Nanny at 'tcp://...:51682'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/.../site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/.../site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/.../asyncio/tasks.py", line 468, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/.../site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/.../asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/.../runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/.../runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/.../site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/.../site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/.../site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/.../site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/.../site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/.../site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/.../site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/.../site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/.../site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/.../asyncio/tasks.py", line 691, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/.../site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 240 seconds
As you can see, the second worker seems to never listen. It do only nanny related things.
Do you have an idea, why the second worker never give up ?
Thank you
edit :
i have the same errors with HtCondor :
distributed.nanny - INFO - Start Nanny at: 'tcp://10.5.230.211:22967'
distributed.nanny - INFO - Closing Nanny at 'tcp://10.5.230.211:22967'
distributed.nanny - WARNING - Worker process still alive after 0 seconds, killing
distributed.dask_worker - INFO - End worker
Traceback (most recent call last):
File "/site-packages/distributed/nanny.py", line 338, in start
response = await self.instantiate()
File "/site-packages/distributed/nanny.py", line 407, in instantiate
result = await asyncio.wait_for(
File "/asyncio/tasks.py", line 466, in wait_for
await waiter
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/asyncio/tasks.py", line 490, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/site-packages/distributed/core.py", line 269, in _
await asyncio.wait_for(self.start(), timeout=timeout)
File "/asyncio/tasks.py", line 492, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/site-packages/distributed/cli/dask_worker.py", line 469, in <module>
go()
File "/site-packages/distributed/cli/dask_worker.py", line 465, in go
main()
File "/site-packages/click/core.py", line 1126, in __call__
return self.main(*args, **kwargs)
File "/site-packages/click/core.py", line 1051, in main
rv = self.invoke(ctx)
File "/site-packages/click/core.py", line 1393, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/site-packages/click/core.py", line 752, in invoke
return __callback(*args, **kwargs)
File "/site-packages/distributed/cli/dask_worker.py", line 451, in main
loop.run_sync(run)
File "/site-packages/tornado/ioloop.py", line 530, in run_sync
return future_cell[0].result()
File "/site-packages/distributed/cli/dask_worker.py", line 445, in run
await asyncio.gather(*nannies)
File "/asyncio/tasks.py", line 688, in _wrap_awaitable
return (yield from awaitable.__await__())
File "/site-packages/distributed/core.py", line 273, in _
raise TimeoutError(
asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds
Solution 1:[1]
It works with --no-dashboard option passed to any dask-worker
https://github.com/dask/dask-jobqueue/issues/391#issuecomment-639257428
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Klun |
