Hadoop High Availability Tutorial
In this tutorial, we will have a look at the High Availability feature of the Apache Hadoop Cluster. High Availability is one of the most important feature which is needed especially when the cluster is in production state. We do not want any single failure to make the whole cluster unavailable, so this is when High Availability of Hadoop comes into play.
1. Introduction
We will try to understand the High Availability feature of Hadoop and how to configure it in the HDFS cluster.
There are two ways to achieve High Availability in the cluster. They are:
- Using the Quorum Journal Manager(QJM)
- Using the NFS for the shared storage
In this tutorial, we will learn about setting up HA HDFS cluster using Quorum Journal Manager(QJM). QJM shares the edit logs between the Active and Standby NameNodes to keep both of them in sync so that Standby NameNode is ready and updated if and when it needs to take control of the cluster.
Note: This tutorial assumes that you have the general understanding of the Hadoop, Hadoop Cluster and HDFS Architecture. You can go through the following articles in case needed to have the basic understanding of what you need for this tutorial:
- Apache Hadoop Cluster Setup Example (with Virtual Machines)
- Apache Hadoop Zookeeper Example
- Apache Hadoop Distributed File System Explained
- Big Data Hadoop Tutorial for Beginners
2. How it works?
In Hadoop v1, the NameNode was always the single point of failure in the whole cluster. Any other failure could be handled very well except the NameNode. If the NameNode fails or goes down, the whole cluster will be unavailable for any work unless and until the NameNode is restored back or a new NameNode is added to the cluster.
So Hadoop 2.0 onwards we have the solutions for this, the High Availability feature where we can run redundant NameNodes in the same cluster out of which one will be active and other can be standby. In v2.0 only two redundant nameNodes were supported but in v3.0 we can now add more than 2 redundant NameNodes. But only one NameNode can be active at all times. The active NameNode is responsible for all client operations in the cluster; the standby nodes are just another worker nodes but they also maintaining enough state information that in case of failure in the Active NameNode they can provide fast failover.
To maintain this state and keeping all the active and standby NameNodes synced, QJM comes into action. All the NameNodes communicate with a group of separate daemons called Journal Nodes (JNs). Active node logs all the modification to the majority of Journal Nodes as soon as they are done, then the Standby NameNodes are constantly watching Journal Nodes for these modifications. As soon as the modification is logged in the Journal Node, the Standby NameNodes applies these changes to its own namespace.
Now, the standby NameNodes are also up to-date in case of any failure, to add another precautionary measure, in case of failure of active NameNode, Standby Node will read all the logs and will make sure its namespace is upto-date before taking the role of Active NameNode.
It is not sufficient to keep standby NameNode updated with the namspace changed and edit logs. In order to take control standby NameNode also need to have updated information regarding the status of all the DataNodes and the location of all the data blocks in the cluster. So issues are solved by configuring the DataNodes so that they send block location information and the heartbeats to all the NameNodes and not only to the Active NameNode. This way, the standby NameNode will have all the required information about the DataNodes and the blocks of data on these DataNodes.
3. Configuration for HA
Following are the configuration setting needed to enable Hadoop Cluster for High availability.
3.1 hdfs-site.xml
First of all we have to set the appropriate configuration in the hdfs-site.xml
file to assing ids to the NameNodes. Following are the required configurations:
dfs.nameservices
Nameservices as indicated in the name is the logical name for the cluster which we will be setting up. This name will be used as a logical name for the cluster in other configuration settings as well as the authority component of absolute HDFS paths.
<property> <name>dfs.nameservice</name> <value>testcluster</value> </property>
dfs.ha.namenodes.[nameservice ID] (dfs.ha.namenodes.testcluster)
This configuration setting identifies each NameNode with the unique ids. In this setting, we will list all the NameNodes with a comma-separated list of IDs. DataNodes will check this setting to know about all the NameNodes in the cluster and will send the heartbeats to these NameNodes.
Let us assume we have 3 NameNodes set up with the IDs namenode1
, namenode2
and namenode3
. The configuration should be as below:
<property> <name>dfs.ha.namenodes.testcluster</name> <value>namenode1,namenode2,namenode3</value> </property>
dfs.namenode.rpc-address.[nameservice ID].[namenode ID]
This configuration setting is to define the fully qualified RPC address of each NameNode.
<property> <name>dfs.namenode.rpc-address.testcluster.namenode1</name> <value>machine1.example.com:9820</value> </property> <property> <name>dfs.namenode.rpc-address.testcluster.namenode2</name> <value>machine2.example.com:9820</value> </property> <property> <name>dfs.namenode.rpc-address.testcluster.namenode3</name> <value>machine3.example.com:9820</value> </property>
dfs.namenode.http-address.[nameservice ID].[namenode ID]
This configuration setting is to define the fully qualified HTTP address of each NamNode.
<property> <name>dfs.namenode.http-address.testcluster.namenode1</name> <value>machine1.example.com:9870</value> </property> <property> <name>dfs.namenode.http-address.testcluster.namenode2</name> <value>machine2.example.com:9870</value> </property> <property> <name>dfs.namenode.http-address.testcluster.namenode3</name> <value>machine3.example.com:9870</value> </property>
dfs.namenode.shared.edits.dir
This configuration will define the URI of the daemon where the Journal Node is present so that Active NameNode can write the edit logs and Standby NameNodes can read the edit logs from.
Let us assume that the Journal Nodes are running on the following machines:
- node1.example.com
- node2.example.com
- node3.example.com
and our nameservice id is same as above i.e. “testcluster”. The default port for the Journal Node is 8485.
The complete configuration will be as below:
<property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://node1.example.com:8485;node2.example.com:8485;node2.example.com:8485/testcluster</value> </property>
dfs.client.failover.proxy.provider.[nameservice ID]
Failover proxy provider is the Java class from the Hadoop Package which will be used by HDFS clients to determine which NameNode is the Active node and need to be used to serve client requests.
As of now, there are two implementations which comes with the Hadoop Package, they are:
- ConfiguredFailoverProxyProvider
- RequestHedgingProxyProvider
The configuration setting will be as below:
<property> <name>dfs.client.failover.proxy.provider.testcluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailover.ProxyProvider</value> </property>
dfs.ha.fencing.methods
As we discussed above, it is very important that only one NameNode is active at a time. Quorum Journal Manager makes sure that we have only one active NameNode at a time. But still, in case of any failure on the QJM part, we should have a fencing method to make sure it never happens again.
There are two fencing methods which can be used:
- sshfence: The sshfence as the name suggests SSH to the target node and uses fuser to kill the process listening to the service’s TCP port. This allows us to make sure that the failed Active NameNode is not longer listening to any requests from clients.
<property> <name>dfs.ha.fencing.method</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/exampleuser/.ssh/id_rsa</value> </property>
- shell
The shell fencing methos runs a shell command. The configuration is as below:<property> <name>dfs.ha.fencing.method</name> <value>shell(/path/to/the/script.sh args1 args2 args3 ...)</value> </property>
This brings us to the end of the configuration settings in the file hdfs-site.xml
. Now we will configure the core-site.xml
3.2 core-site.xml
In this section, we will address the configuration setting which needs to be configured in the core-site.xml
file.
fs.defaultFS
This configuration setting provides the default path which will be used by the Hadoop FS client when none is provided. We can use the HA-enabled logical URI which we assigned to the cluster in the hdfs-site.xml
file.
The configuration will be as below:
<property> <name>fs.defaultFS</name> <value>hdfs://testcluster</value> </property>
dfs.journalnode.edits.dir
This configuration setting defines the absolute path to where the JournalNode will store its edits data and the local state. We will only provide a single path for this configuration. Redundancy for this data is provided by running multiple separate JournalNodes, or by configuring the directory on a locally attached RAID array.
The configuration will be as below:
<property> <name>dfs.journalnode.edits.dir</name> <value>/path/to/the/journal/node/data/directory</value> </property>
3.3 Zookeeper Configuration
All the configuration settings above make the Hadoop Cluster High Availability but the failover needs to be manual. In this mode, the system will not trigger a failover automatically from active to the standby NameNode.
But it is beneficial to make this failover automatically so that we do not need to keep and eye on the NameNode failure and then trigger it manually. We can configure this in Zookeeper to make the failover automatic.
ZKFailoverController (ZKFC) is a new component in Zookeeper which monitors and manages the state of the NameNode and help in automatical failover. All the nodes which runs NameNode need to run ZKFs also.
To configure automatic failover and the use of ZKFC, we will need to set two configuration settings in hdfs-site.xml
and core-site.xml
file.
Note: This assumes that Zookeeper is already installed properly on the cluster. We will not cover installing and running Zookeeper on the cluster.
The configuration setting in hdfs-site.xml
will be as below:
<property> <name>dfs.ha.automatic-failover.enabled</name> <value>true</value> </property>
The configuration setting in core-site.xml
will be as below:
<property> <name>hs.zookeeper.quorum</name> <value>zk1.example.com:2181,zk2.example.com:2181,zk2.example.com:2181</value> </property>
3.4 Starting the cluster
With all the configurations in place, we are ready to start the cluster now. Following are the commands we need to run to start the cluster in HA mode.
$HADOOP_HOME/bin/hdfs zkfc -formatZK
The above command will create a znode in Zookeeper inside of which thw automatic failover system will store its data.
Now since automatic failover is enables in the configurations, we will use the following command to automatically start the ZKFC daemon on any machine that runs a NameNode.
start-dfs.sh
Once it starts the cluster, it will automatically select one of the NameNode to become active and other will be kept on standby.
4. Conclusion
In this tutorial, we learned how to configure the Apache Hadoop Cluster to make it High Availability.
We discussed the issues of single point of failure which was present in the Hadoop cluster before v2.0 and how it is fixed in the altest version. We discussed how to active and the standby NameNodes interact and be in sync so that in case of failure, standby node can take over anytime. Followed by we saw all the configurations need to be done in hdfs-site.xml
, core-site.xml
and the relevant zookeeper configuration settings so that the failover can be initiated automatically.
Please feel free to comment if there is any comfusion or if you encounter any problem in setting up the High Availability Hadoop Cluster.
For Map Reduce client Job, is there any additional configuration required (in the Job Java file) to make use of HA name service? Will the approach be different for hadoop jar & java -jar option?