'Flink TaskManager not reconnecting to the new Jobmanager

I have configured Flink in HA mode as mentioned here:

I wanted to test the fault tolerance, hence I did the following:

  1. Setup Flink cluster with 2 JobManagers and 1 TaskManager
  2. Start a streaming job on task manager
  3. Kill the active job manager(to simulate a crash)
  4. The leader election is happening as expected.
  5. But the task manager is noted reconnecting to the new job manager. It simply tries to reconnect to the previous leader every 10seconds.

Pasting the task manager log here:

2018-07-25 19:46:08,508 INFO  org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration  - Messages have a max timeout of 10000 ms
2018-07-25 19:46:08,515 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService              - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2018-07-25 19:46:08,524 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
2018-07-25 19:46:08,525 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Start job leader service.
2018-07-25 19:46:08,529 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://[email protected]:46477/user/resourcemanager(b91b9aeb3565be973c9bb47259414e0a).
2018-07-25 19:46:08,574 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.10.97.210:46477
2018-07-25 19:46:08,576 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://[email protected]:46477] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:46477]] Caused by: [Connection refused: /10.10.97.210:46477]
2018-07-25 19:46:08,579 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://[email protected]:46477/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[email protected]:46477/user/resourcemanager..
2018-07-25 19:46:18,606 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.10.97.210:46477
2018-07-25 19:46:18,607 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://[email protected]:46477] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://[email protected]:46477]] Caused by: [Connection refused: /10.10.97.210:46477]
2018-07-25 19:46:18,607 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not resolve ResourceManager address akka.tcp://[email protected]:46477/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://[email protected]:46477/user/resourcemanager..
  1. Restarting task manager doesn't help
  2. Restarting cluster doesn't help

Please guide me if anything is missing.



Solution 1:[1]

Looking into the logs:

Connection refused: /10.10.97.210:46477

Was port 46477 was opened/excluded from firewall?

Just check if you have set the following in flink config:

jobmanager.rpc.port: 6123 
blob.server.port: 50100-50200 

And then unblock these ports.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1