'Multi-node processing using DistributedDataParallel giving me a permission denied error
I'm trying to implement a multi-node job using Pytorch's DistributedDataParallel on a University supercomputer where I'm logging in via ssh using port 22. Following this tutorial, when I set MASTER_PORT=12340 or some other number on the SLURM script, I obviously get no response since there's nothing happening on it. If I set MASTER_PORT=22, I get permission denied when the code reaches the dist.init_process_group() method:
dist.init_process_group(backend=opt.dist_backend, init_method=opt.dist_url,
world_size=opt.world_size, rank=opt.rank)
GIVES ME
Traceback (most recent call last):
File "train_dist.py", line 262, in <module>
main()
File "train_dist.py", line 220, in main
world_size=opt.world_size, rank=opt.rank)
File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:22 (errno: 13 - Permission denied). The server socket has failed to bind to 0.0.0.0:22 (errno: 13 - Permission denied).
I have also tried to re-route the port 22 traffic to some other port (eg. 65000) but I also get permission denied for even attempting this rerouting. I'm not sure what else I can try to do at this point, anyone has any suggestions or is this something that I need to ask the administrator for?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
