'Timeout in multi-machine training of pytorch?

Error occurs in multi-machine training of pytorch:

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:136] Timed out waiting 1800000ms for send operation to complete

I expand timeout limit to 3 days, same error occurs still.

How to deal with that? Thanks~

dist.init_process_group(
    backend=args.dist_backend, 
    init_method=args.dist_url,
    world_size=args.world_size, rank=args.rank,
    timeout=datetime.timedelta(days=3)
)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source