ZooKeeper is a centralized service for maintaining and synchronizing configuration data. Zookeeper is used to set a distributed lock on incoming data that is then ingested by a client. Zookeeper services use odd numbers for redundancy, it is best to have a minimum of three Zookeeper services.
About ZooKeeper
In distributed application engineering, the word node can refer to a generic host machine, a server, a member of an ensemble, a client process, etc. In the ZooKeeper documentation, znodes refer to the data nodes. Servers to refer to machines that make up the ZooKeeper service; quorum peers refer to the servers that make up an ensemble; client refers to any host or process which uses a ZooKeeper service.
Znodes have three characteristics:
Watches
Clients can set watches on znodes. Changes to that znode trigger the watch and then clear the watch. When a watch triggers, ZooKeeper sends the client a notification. More information about watches can be found in the section ZooKeeper Watches. For example: HDFS HA sets a watcher on startup.
Data Access
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what. You can refer to this type as persistent: This is the default type of znode in any Zookeeper. Persistent nodes are always present and they contain the important configuration details. When a new node is added to the Zookeeper it goes to persistent znode and gets the configuration information.
Ephemeral Nodes
ZooKeeper also has the notion of ephemeral nodes. These znodes exists as long as the session that created the znode is active. When the session ends the znode is deleted. Because of this behavior ephemeral znodes are not allowed to have children. Ephemeral Nodes are mainly useful to keep check on client applications in case of failures. As the application fails the znode dies.
Configure ZooKeeper
Install ZooKeeper
Cloudera Manager distributes ZooKeeper in CDH. On install, you will select the nodes required to run the ZooKeeper Server. Generally it is best to install ZooKeeper servers in groups of three.
- ZooKeeper Server (due to the CPU requirements of the ZooKeeper Server, only a few services should share the host with this service – make sure a ZooKeeper Server is not shared with an HBase RegionServer, HDFS DataNode, or a YARN ResourceManager. I’ve had luck co-locating the ZooKeeper Server with an HDFS JournalNode.
ZooKeeper Configuration
Configure ZooKeeper with the following settings:
zookeeper-client
Connect to the Zookeeper-client using the following command
zookeeper-client -server <zookeeper-alias>:2181
Create RAMDISK for ZooKeeper
Create the directory to be mounted
sudo mkdir -p /mnt/ramdisk1
Change permissions on the directory
sudo chmod -R 775 /mnt/ramdisk1
sudo chown -R zookeeper:zookeeper /mnt/ramdisk1
The tempfs filesystem is a RAMDISK. Mount the tempfs filesystem
sudo mount -t tmpfs -o size=8192M tmpfs /mnt/ramdisk1
To make the ramdisk permanently available, add it to /etc/fstab.
sudo grep /mnt/ramdisk1 /etc/mtab | sudo tee -a /etc/fstab
Troubleshooting
ZooKeeper Canary test failed to establish a connection or a client session to the ZooKeeper service – Too many connections
I once noticed that my Hive commands were failing. I looked at ZooKeeper and saw the Canary test failed to establish a connection or a client session to the ZooKeeper service error. I looked in the ZooKeeper service log and saw the following:
1:35:11.059 PM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /192.168.210.253 – max is 60
1:35:11.059 PM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /192.168.210.253 – max is 60
1:35:11.063 PM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /192.168.210.253 – max is 60
Cause: I was running commands through Hive that were not disconnecting from ZooKeeper. There is a bug in Hue which causes this issue. Especially with this command: ALTER TABLE ADD PARTITION
Resolution: Restart Hive might work, otherwise restart the Hue service. You also might have to increase the max number of connections.
References: http://community.cloudera.com/t5/Batch-SQL-Apache-Hive/hiveserver2-fails-to-release-zookeeper-connection-in-CDH5-0-1/td-p/13598, https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/WSt7nB84Lm0
ZooKeeper: Cannot Start ZooKeeper: Last Transaction Was Partial
The disk was full, and after recovering you receive an error that you cannot start Zookeeper, there is an exception: Last transaction was partial. You must delete the snapshot file from the /var/lib/zookeeper/ version-2/ data folder.
Log file: /var/log/zookeeper/zookeeper-cmf-zookeeper1-SERVER-servername01.log
2014-06-11 16:02:59,874 INFO org.apache.zookeeper.server.persistence.FileSnap: Reading snapshot /var/lib/zookeeper/version-2/snapshot.a59a5
2014-06-11 16:02:59,982 ERROR org.apache.zookeeper.server.persistence.Util: Last transaction was partial.
2014-06-11 16:02:59,983 ERROR org.apache.zookeeper.server.ZooKeeperServerMain: Unexpected exception, exiting abnormally
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.persistence.FileHeader.deserialize(FileHeader.java:64)
…
2014-06-11 16:08:14,298 INFO org.apache.zookeeper.server.quorum.QuorumPeerConfig: Reading configuration from: /run/cloudera-scm-agent/process/110-zookeeper-server/zoo.cfg
ZooKeeper SessionExpired Events: HBase RegionServer service stops
Error:
WARN org.apache.hadoop.hbase.util.Sleeper
We slept 247488ms instead of 3000ms, this is likely due to a long garbage collecting pause and it’s usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
FATAL org.apache.hadoop.hbase.regionserver.HRegionServer
ABORTING region server servername32,60020,1378856731842: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing servername32,60020,1378856731842 as dead server
at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:254)
at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:172)
at org.apache.hadoop.hbase.master.HMaster.regionServerReport(HMaster.java:1010)
…
Resolution:
Master or RegionServers shutting down with messages like those in the logs:
WARN org.apache.zookeeper.ClientCnxn: Exception
closing session 0x278bd16a96000f to sun.nio.ch.SelectionKeyImpl@355811ec
java.io.IOException: TIMED OUT
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:906)
WARN org.apache.hadoop.hbase.util.Sleeper: We slept 79410ms, ten times longer than scheduled: 5000
INFO org.apache.zookeeper.ClientCnxn: Attempting connection to server hostname/IP:PORT
INFO org.apache.zookeeper.ClientCnxn: Priming connection to java.nio.channels.SocketChannel[connected local=/IP:PORT remote=hostname/IP:PORT]
INFO org.apache.zookeeper.ClientCnxn: Server connection successful
WARN org.apache.zookeeper.ClientCnxn: Exception closing session 0x278bd16a96000d to sun.nio.ch.SelectionKeyImpl@3544d65e
java.io.IOException: Session Expired
at org.apache.zookeeper.ClientCnxn$SendThread.readConnectResult(ClientCnxn.java:589)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:709)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired
The JVM is doing a long running garbage collecting which is pausing every threads (aka “stop the world”). Since the RegionServer’s local ZooKeeper client cannot send heartbeats, the session times out. By design, we shut down any node that isn’t able to contact the ZooKeeper ensemble after getting a timeout so that it stops serving data that may already be assigned elsewhere.
- Make sure you give plenty of RAM (in hbase-env.sh), the default of 1GB won’t be able to sustain long running imports.
- Make sure you don’t swap, the JVM never behaves well under swapping.
- Make sure you are not CPU starving the RegionServer thread. For example, if you are running a MapReduce job using 6 CPU-intensive tasks on a machine with 4 cores, you are probably starving the RegionServer enough to create longer garbage collection pauses.
- Increase the ZooKeeper session timeout
If you wish to increase the session timeout, add the following to your hbase-site.xml to increase the timeout from the default of 60 seconds to 120 seconds.
<property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property>
Be aware that setting a higher timeout means that the regions served by a failed RegionServer will take at least that amount of time to be transfered to another RegionServer. For a production system serving live requests, we would instead recommend setting it lower than 1 minute and over-provision your cluster in order the lower the memory load on each machines (hence having less garbage to collect per machine).
If this is happening during an upload which only happens once (like initially loading all your data into HBase), consider bulk loading.
See Section 13.11.2, “ZooKeeper, The Cluster Canary” for other general information about ZooKeeper troubleshooting.