'ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
I have an hbase (0.96.1.1-cdh5.0.2) cluster on AWS managed by Cloudera with 4 region servers and 1 zookeeper server. The zookeeper server is running on the same host as the hbase master. The problem I'm facing is that 3/4 region servers are down because they can't connect to the zookeeper. The only region server that stays up is the one running on the same host as the master and zookeeper. Below is the relevant section of one of the failing region server logs.
2014-11-14 15:46:59,871 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=ip-10-146-188-157.ec2.internal:2181 sessionTimeout=60000 watcher=regionserver:60020, quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase
2014-11-14 15:46:59,915 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=regionserver:60020 connecting to ZooKeeper ensemble=ip-10-146-188-157.ec2.internal:2181
2014-11-14 15:46:59,920 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:47:00,649 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020
2014-11-14 15:47:59,948 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60041ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:48:00,067 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:48:00,072 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 1000ms before retry #0...
2014-11-14 15:48:01,067 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:49:00,123 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60057ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:49:00,224 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:49:00,224 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 2000ms before retry #1...
2014-11-14 15:49:01,224 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:50:00,259 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60035ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:50:00,360 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:50:00,360 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 4000ms before retry #2...
2014-11-14 15:50:01,360 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:51:00,408 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60048ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:51:00,509 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:51:00,509 INFO org.apache.hadoop.hbase.util.RetryCounter: Sleeping 8000ms before retry #3...
2014-11-14 15:51:01,509 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ip-10-146-188-157.ec2.internal/10.146.188.157:2181. Will not attempt to authenticate using SASL (unknown error)
2014-11-14 15:52:00,559 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 60051ms for sessionid 0x0, closing socket connection and attempting reconnect
2014-11-14 15:52:00,659 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=ip-10-146-188-157.ec2.internal:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
2014-11-14 15:52:00,660 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 4 attempts
2014-11-14 15:52:00,661 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: regionserver:60020, quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Unable to set watcher on znode /hbase/master
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,687 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: regionserver:60020, quorum=ip-10-146-188-157.ec2.internal:2181, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
at java.lang.Thread.run(Thread.java:744)
2014-11-14 15:52:00,692 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server 0.0.0.0,60020,1415998019646: Unexpected exception during initialization, aborting
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1041)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:199)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:425)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:77)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeZooKeeper(HRegionServer.java:671)
at org.apache.hadoop.hbase.regionserver.HRegionServer.preRegistrationInitialization(HRegionServer.java:644)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:772)
at java.lang.Thread.run(Thread.java:744)
I suspect that this might be related to the /etc/hosts configuration, but can't figure out what the problem is. The /etc/hosts for each of the instances in the cluster is:
127.0.0.1 localhost.localdomain localhost
::1 localhost6.localdomain6 localhost6
The hbase-site.xml fraction dealing with zookeeper is.
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value>
</property>
<property>
<name>zookeeper.znode.rootserver</name>
<value>root-region-server</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>ip-10-146-188-157.ec2.internal</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
Any help will be greatly appreciated.
Solution 1:[1]
Have you given FQDN to your host machines? If not then give it and try changing the corresponding "localhost" instance or ip with the FQDN in the configuration files.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Vikas |
