In the last chapter, we saw how to write a minimal Storm topology and run it on the local mode and a single-node Storm cluster. In this chapter, we will cover the following topics:
• How to run the sample topology in a distributed Storm cluster
• How to configure the parallelism of a topology
• How to partition a stream using different stream grouping
In the last chapter, we saw how to set up single-node ZooKeeper to use with Storm.
Even though we can proceed with the same ZooKeeper setup for a distributed Storm cluster setup, then it will be a single point of failure in the cluster. To avoid this, we are deploying a distributed ZooKeeper cluster.
It is advised to run an odd number of ZooKeeper nodes, as the ZooKeeper cluster keeps working as long as the majority (the number of live nodes is greater than n/2, where n is the number of deployed nodes) of the nodes are running. So, if we have a cluster of four ZooKeeper nodes (3 > 4/2, only one node can die), then we can handle only one node failure, while if we had five nodes (3 > 5/2, two nodes can die) in the cluster, we can handle two node failures.
We will be deploying a ZooKeeper ensemble of three nodes that will handle one node failure. The following is the deployment diagram of the three-node ZooKeeper ensemble:
Follower Leader Follower
Server1
(zoo1) Server2
(zoo2) Server3
(zoo3)
client client client client
Zoo eeper EnsembleK
A ZooKeeper ensemble
In the ZooKeeper ensemble, one node in the cluster acts as the leader, while the rest are followers. If the leader node of the ZooKeeper cluster dies, then an election for the new leader takes places among the remaining live nodes, and a new leader is elected. All write requests coming from clients are forwarded to the leader node, while the follower nodes only handle the read requests. Also, we can't increase the write performance of the ZooKeeper ensemble by increasing the number of nodes because all write operations go through the leader node.
The following steps need to be performed on each node to deploy the ZooKeeper ensemble:
1. Download the latest stable ZooKeeper release from the ZooKeeper site (http://zookeeper.apache.org/releases.html). At this moment, the latest version is ZooKeeper 3.4.5.
2. Once you have downloaded the latest version, unzip it. Now, we set up the ZK_HOME environment variable to make the setup easier.
3. Point the ZK_HOME environment variable to the unzipped directory. Create the configuration file, zoo.cfg, at $ZK_HOME/conf directory using the following commands:
cd $ZK_HOME/conf touch zoo.cfg
4. Add the following properties to the zoo.cfg file:
tickTime=2000
dataDir=/var/zookeeper clientPort=2181
initLimit=5 syncLimit=2
server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3.2888.3888
Here, zoo1, zoo2, and zoo3 are the IP addresses of the ZooKeeper nodes.
The following are the definitions for each of the properties:
° tickTime: This is the basic unit of time in milliseconds used by ZooKeeper. It is used to send heartbeats, and the minimum session timeout will be twice the tickTime value.
° dataDir: This is the directory to store the in-memory database snapshots and transactional log.
° clientPort: This is the port used to listen to client connections.
° initLimit: This is the number of tickTime values to allow followers to connect and sync to a leader node.
° syncLimit: This is the number of tickTime values that a follower can take to sync with the leader node. If the sync does not happen within this time, the follower will be dropped from the ensemble.
The last three lines of the server.id=host:port:port format specifies that there are three nodes in the ensemble. In an ensemble, each ZooKeeper node must have a unique ID between 1 and 255. This ID is defined by creating a file named myid in the dataDir directory of each node. For example, the node with the ID 1 (server.1=zoo1:2888:3888) will have a myid file at / var/zookeeper with the text 1 inside it.
For this cluster, create the myid file at three locations, shown as follows:
At zoo1 /var/zookeeper/myid contains 1 At zoo2 /var/zookeeper/myid contains 2 At zoo3 /var/zookeeper/myid contains 3
5. Run the following command on each machine to start the ZooKeeper cluster:
bin/zkServer.sh start
6. Check the status of the ZooKeeper nodes by performing the following steps:
1. Run the following command on the zoo1 node to check the first node's status:
bin/zkServer.sh status
The following information is displayed:
JMX enabled by default
Using config: /home/root/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: follower
The first node is running in the follower mode.
2. Check the status of the second node by performing the following command:
bin/zkServer.sh status
The following information is displayed:
JMX enabled by default
Using config: /home/root/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: leader
The second node is running in the leader mode.
3. Check the status of the third node by performing the following command:
bin/zkServer.sh status
The following information is displayed:
JMX enabled by default
Using config: /home/root/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: follower
The third node is running in the follower mode.
7. Run the following command on the leader machine to stop the leader node:
bin/zkServer.sh stop
8. Now, check the status of the remaining two nodes by performing the following steps:
1. Check the status of the first node using the following command:
bin/zkServer.sh status
The following information is displayed:
JMX enabled by default
Using config: /home/root/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: follower
The first node is again running in the follower mode.
2. Check the status of the third node using the following command:
bin/zkServer.sh status
The following information is displayed:
JMX enabled by default
Using config: /home/root/zookeeper-3.4.5/bin/../conf/zoo.cfg Mode: leader
The third node is elected as the new leader.
3. Now, restart the third node with the following command:
bin/zkServer.sh status
This was a quick introduction to setting up ZooKeeper that can be used for development; however, it is not suitable for production. For a complete reference on ZooKeeper administration and maintenance, please refer to the online
documentation at the ZooKeeper site at http://zookeeper.apache.org/doc/
trunk/zookeeperAdmin.html.