Apache Hadoop

How to Install Apache Hadoop on Ubuntu

In this example, we will see the details of how to install Apache Hadoop on an Ubuntu system.

We will go through all the required steps starting with the required pre-requisites of Apache Hadoop followed by how to configure Hadoop and we will finish this example by learning how to insert data into Hadoop and how to run an example job on that data.
 
 
 
 
 

1. Introduction

The example will describe all the required steps for installing a single-node Apache Hadoop cluster on Ubuntu 15.10. Hadoop is a framework for distributed processing of application on large clusters of commodity hardware. It is written in Java and follows the MapReduce computing paradigm.

2. Prerequisites

Following are the prerequisites of running Apache Hadoop on Ubuntu. Follow the steps to get all the prerequisites in place.

2.1 Installing Java

As Apache Hadoop is written in Java, it needs latest Java to be installed in the system. To install Java, first of all update the source list

#Update the source list
sudo apt-get update

It should update all the existing packages as shown in the screeenshot below.

Update Source List
Update Source List

Now install the default jdk using the following command.

# The OpenJDK project is the default version of Java 
sudo apt-get install default-jdk

The OpenJDK is the default version of Java for Ubuntu Linux. It should be successfully installed with the apt-get command.

Installing Java
Installing Java

The default-jdk installs the version 1.7 of Java. Version 1.7 will be fine to run Hadoop but if you would like, you can explicitely install version 1.8 also.

#Java Version
java -version

Java Version
Java Version

This completes the first prerequisite of the Apache Hadoop. Next we will move to creating a dedicated user which Hadoop can use for execution of its tasks.

2.2 Creating a Dedicated User

Hadoop needs a separate dedicated user for execution. With a complete control over the Hadoop executables and data folders. To create a new user, use the following command in the terminal.

#create a user group for hadoop
sudo addgroup hadoop

#create user hduser and add it to the hadoop usergroup
sudo adduser --ingroup hadoop hduser

The first command creates a new group with the name “hadoop” and the second command creates a new user “hduser” and assigns it to the “hadoop” group. We have kept all the user data like “First Name”, “Phone Number” etc empty. You can keep it empty or assign values to the account as per your choice.

Creating dedicated user for Hadoop
Creating dedicated user for Hadoop

2.3 Disable ipv6

Next step is to disable ipv6 on all the machines. Hadoop is set to use ipv4 and that is why we need to disable ipv6 before creating a hadoop cluster. Open /etc/sysctl.conf as root using nano(or any other editor of your choice)

sudo nano /etc/sysctl.conf

and add the following lines at the end of the file.

#commands to disable ipv6
net.ipv6.conf.all.disable-ipv6=1
net.ipv6.conf.default.disable-ipv6=1
net.ipv6.conf.lo.disable-ipv6=1

Disabling ipv6
Disabling ipv6

Save the file using ctrl+X and then Yes when it prompts for saving the file. After this, to check if the ipv6 is properly disabled we can use the following command:

cat /proc/sys/net/ipv6/conf/all/disable-ipv6

it should return 0 or 1 as an output and we want it to be 1 as it symbolizes that the ipv6 is disable

2.4 Installing SSH and Setting up certificate

Hadoop requires SSH access to manage its remote nodes as well as node on local machine. For this example, we need to configure SSH access to localhost.

So, we will make sure we have SSH up and running and set up the public key access to allow it to login without a password. We will set up SSH certificate for allowing a password less authentication. Use the following commands to do the required steps.

ssh has two main components:

  • ssh: The command we use to connect to remote machines – the client.
  • sshd: The daemon that is running on the server and allows clients to connect to the server.

SSH is pre-enabled on ubuntu but to make sure sshd is enables we need to install ssh first using the following command.

#installing ssh
sudo apt-get install ssh

To make sure everything is setup properly, use the following commands and make sure the output is similar to the one displayed in the screenshot.

#Checking ssh
which ssh

#Checking sshd
which sshd

Both the above commands should show the path of the folder where ssh and sshd is installed as shown in the screenshot below. This is to make sure that both are present in the system.

Checking ssh and sshd
Checking ssh and sshd

Now, in order to generate the ssh certificate we will switch to the hduser user. In the following command, we are keeping password empty while generating the key for ssh, you can give it some password if you would like to.

#change to user hduser
su hduser

#generate ssh key
ssh-keygen -t rsa -P "" 

The second command will create an RSA key-pair for the machine. The password for this key will be empty as mentioned in the command. It will ask for the path to store the key with default path being $HOME/.ssh/id-rsa.pub, just press enter when prompted to keep the same path. If you plan to change the path then remember it as it will be needed in the next step.

Generating ssh key
Generating ssh key

Enable SSH access to the machine with the key created in the previous step. For this, we have to add the key to the authorized keys list of the machine.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

We can check if ssh works as following, is the ssh to localhost is succesful witout password prompt, then the certificate is properly enabled.

ssh localhost

By now, we are done with all the prerequisites for the Apache Hadoop. We will check how to setup Hadoop in the next section.

3. Installing Apache Hadoop

After all the prerequisites, we are ready to install Apache Hadoop on our Ubuntu 15.10 machine.

3.1 Download Apache Hadoop

  1. Download Hadoop from Apache Mirrors at www.apache.org/dyn/closer.cgi/hadoop/core. It can be downloaded manually or using wget command.
  2. After download finishes, extract hadoop folder and move it to /usr/local/hadoop and finally change the owner of the folder to hduser and hadoop group.
#Change to the directory
cd /usr/local

#move hadoop files to the directory
sudo mv /home/hadoop1/Downloads/hadoop-2.7.1 hadoop

#change the permissions to the hduser user.
sudo chown -R hduser:hadoop hadoop

We can now check the permissions of the hadoop folder using the command:

ls -lah

This command shows the list of content in the /usr/local/ directory along with the metadata. Hadoop fodler should have hduser as the owner and hadoop as the user group as shown in the screenshot below.

Placing hadoop in required folder and assigning dedicated user as owner of hadoop
Placing hadoop in required folder and assigning dedicated user as owner of hadoop

3.2 Updating bash

  1. Update the bashrc file for the user hduser.
  2.    su - hduser
       nano $HOME/.bashrc
    
  3. At the end of the file, add the following lines.
  4.    export HADOOP_HOME=/usr/local/hadoop
       export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
    
       #Some convenient aliases
       unalias fs &> /dev/null
       alias fs="hadoop fs"
       unalias hls &> /dev/null
       alias hls="fs -ls"
    
       export PATH=$PATH:$HADOOP_HOME/bin
    

The block of convenient aliases is optional and can be omitted. JAVA_HOME, HADOOP_HOME and PATH are the only compulsary requirements.

Updating .bashrc file
Updating .bashrc file

3.3 Configuring Hadoop

In this step, we will configure the Hadoop.

  1. Open hadoop-env.sh in /usr/local/hadoop/etc/hadoop/ and set the JAVA_HOME variable as shown below:
    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386
    
  2. and save the file using ctrl+X and then Yes.

    Note: The path to java should be the path where the java is present in the system. By default it should be in the /usr/lib folder, but make sure it is the correct path as per your system. Also, make sure the version of java is correct which you want to use. Following screenshot shows where it need to be modified in the hadoop-env.sh.

    Updating hadoop-env.sh file
    Updating hadoop-env.sh file

  3. Next, we will configure the core-site.xml in the folder /usr/local/hadoop/etc/hadoop/ and add the following property

    <configuration>
       <property>  
          <name>fs.defaultFS</name>
          <value>hdfs://localhost:54310</value>
       </property>
    </configuration>
    
  4. This tells the system where the default file system should be running on the system.

    Updating core-site.xml
    Updating core-site.xml

  5. Next we need to update hdfs-site.xml. This file is used to specify the directories which will be used as the namenode and the datanode.

    <configuration>
       <property>
          <name>dfs.replication</name>
          <value>2</value>
       </property>
       <property>
          <name>dfs.namenode.name.dir</name>
          <value>/usr/local/hadoop/hdfs/namenode</value>
       </property>
       <property>
          <name>dfs.datanode.data.dir</name>
          <value>/usr/local/hadoop/hdfs/datanode</value>
       </property>
    </configuration>   
    
  6. Updating hdfs-site.xml
    Updating hdfs-site.xml

  7. Now, we will update mapred-site.xml file. The folder /usr/local/hadoop/etc/hadoop/ contains the file mapred-site.xml.template. Rename this file to mapred-site.xml before modification.

    <configuration>
       <property>
          <name>mapreduce.jobtracker.address</name>
          <value>localhost:54311</value>
       </property>
    </configuration>
    
  8. Updating mapred-site.xml
    Updating mapred-site.xml

3.4 Formatting the Hadoop Filesystem

We are now done with all the configuration, so before starting the cluster we need to format the namenode. To do so, use the following command on the terminal.

hdfs namenode -format

This command should be executed without any error on the console output. If it is executed without any errors, we are good to start the Apache Hadoop instance on our Ubuntu system.

3.5 Starting Apache Hadoop

Now it is time to start the Hadoop. Following is the command to do so:

/usr/local/hadoop/sbin/start-dfs.sh

Starting Hadoop
Starting Hadoop

Once the dfs starts without any error, we can check if everything is working fine using the command jps

cd /usr/local/hadoop/sbin

#Checking the status of the Hadoop components
jps

This command displays all the components of Hadoop which are running properly, we should see atleast a Namenode and a Datanode as shown in the screenshot below.

jps command
jps command

Other options is to check the status of Apache Hadoop using the web interface for the Namenode on http://localhost:50070.

Apache Hadoop web interface
Apache Hadoop web interface

Following screenshot displays the details of Namenode in the web interface

Namenode in Hadoop Web Interface
Namenode in Hadoop Web Interface

and the following screenshot shows the Datanode details in the Hadoop web interface

Datanode in Hadoop Web Interface
Datanode in Hadoop Web Interface

3.6 Testing MapReduce Job

  1. First of all, lets make the required HDFS directories and copy some input data for testing purpose

    #Make the required directories in HDFS
    bin/hdfs dfs -mkdir /user
    bin/hdfs dfs -mkdir /user/hduser
    

    These directories can be accessed from the web interface also. To do so, go to the web interface, from the menu select ‘Utilities’ and from dropdown select ‘Browse the file system’

  2. Browse HDFS File System
    Browse HDFS File System

  3. Now, we can add some dummy files to the directory which we will use for the testing purpose. Let us pass the all the files from etc/hadoop folder.

    #Copy the input files into the distributed file system
    /usr/local/hadoop/bin/hdfs dfs -put /usr/local/hadoop/etc/hadoop input
    

    Following screenshot shows the files added to the directories /user/hduser/input in the web interface

  4. Browse HDFS File System
    Browse HDFS File System

  5. Run the MapReduce example job included in the Hadoop package using the following command:

    /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-example-2.7.1.jar grep input output 'dfs[a-z.]+'
    

    Note: For details on how MapReduce example works, refer to the article “Hadoop Hello World Example”

    Following screenshot shows the output log of the test example:

  6. Wordcount example console output
    Wordcount example console output

  7. We can now view the output file using the command

    /usr/local/hadoop/bin/hdfs dfs -cat output/*
    

    or using the web interface also as displayed in the screenshot below:

  8. Output folder in hdfs
    Output folder in hdfs

3.7 Stopping Apache Hadoop

We can now stop the dfs(distributed format system) using the following command:

/usr/local/hadoop/sbin/stop-dfs.sh

Stopping Apache Hadoop
Stopping Apache Hadoop

4. Conclusion

This brings us to the end of the example. By now, we have Apache Hadoop Installed on our Ubuntu system and we know how to add data to the Hadoop and how to execute the job on the added data. After this, you can play around with Hadoop. You may also like to follow the example to know some of the common Hadoop File System commands.

Raman Jhajj

Ramaninder has graduated from the Department of Computer Science and Mathematics of Georg-August University, Germany and currently works with a Big Data Research Center in Austria. He holds M.Sc in Applied Computer Science with specialization in Applied Systems Engineering and minor in Business Informatics. He is also a Microsoft Certified Processional with more than 5 years of experience in Java, C#, Web development and related technologies. Currently, his main interests are in Big Data Ecosystem including batch and stream processing systems, Machine Learning and Web Applications.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Vinod Luhar
2 years ago

Java Installation and configuration with ‘update-alternative’ is the perfect solution for all ubuntu versions.
But installation using using apt-get is time-consuming for different linux flavours like ubuntu, linux mint and etc.
I resolved and documented java installation steps for ubuntu version. I hope it may be helpful for beginner. you can visit below link
https://vinodluhar.com/how-to-install-java-on-ubuntu-2021/

Last edited 2 years ago by Vinod Luhar
Back to top button