'In AWS cloud Spark cluster sometimes craches with java.net.ConnectException on checkpoin()

Failure happens quite rarely, and on different tasks but everything is connected with the checkpoint() call.
hdfs is used for checkpointing perhaps the problem is its instability or incorrect configuration.
At the same time there is always enough disk and memory in the cluster this can be seen from the Ganglia graphs. I'm already thinking of removing hdfs and as before using a regular file system for this.
Or the problem is in AWS instability, since we use AWS spot instances. The exception looks like

java.net.ConnectException: Call From slave-name-12224/ip to master:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

I still don't know how else to catch this error. Maybe someone has encountered this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source