Apache Hadoop

Prerequisites for Learning Hadoop

In this article, we will dig deep to understand what are the prerequisites of learning and working with Hadoop. We will see what are the required things and what are the industry standard suggested things to know before you start learning Hadoop
 
 
 
 
 
 
 
 
 

1. Introduction

Apache Hadoop is the entry point or we can say the base for entering into the whole Big Data Ecosystem. It serves as the base for most of the advanced tools, applications, and frameworks in the Big Data Ecosystem but for learning Apache Hadoop also there are some things which you need to know beforehand.

There are not strict prerequisites to start learning Apache Hadoop. However, it makes things easier and if you want to become and expert in Apache Hadoop , these are the good to know things.

So two very basic prerequisites for Apache Hadoop are:

  1. Java
  2. Linux
  3. SQL

We will cover these in the next sections.

2. Java

Knowing Java is not a strict prerequisite for working with Hadoop but knowing it is obviously an added advantage to dig deep and understand the working of Apache Hadoop.

It might sound strange that the first prerequisite I have mentioned is Java and I am saving this is also not a strict prerequisite but an addition. Well, let us see why so.

There are tools and applications like Pig, Hive etc. which are built on top of Hadoop. These tools offer their own high-level interaction languages for working with the data stored and processed on Apache Hadoop cluster. For example, Pig Latin for Pig and HiveQL for Hive. So for people who do not want to dig deep into writing complex MapReduce applications but want to interact with the data in cluster using Hive or Pig, can skip Java.

For writing Hadoop MapReduce applications also Java is not the only option. Hadoop provides an option to use any language which can read from standard input and which can write to standard output to write MapReduce programs using the component called Hadoop Streaming. For example, Python, Ruby, C etc. But as Apache Hadoop is written in Java, to work with the components as close as possible, Java is the language to go with. Pig Latin and HiveQL commands are also converted to Java MapReduce programs internally and executed.

So, if you want to know nuts and bolts of Apache Hadoop and if the requirements become more and more complex, Java is a prerequisite for Apache Hadoop.

Note: To Leaning more about the basics of Hadoop MapReduce, Hadoop Streaming, and Hive, follow the articles below:

But still, why to use Java when we have Hadoop Steaming
For sure Hadoop Steaming provides the option to use many languages to write MapReduce programs but there are some advantages and advance features which only Java API have in Apache Hadoop as of now.

So, Java is not a strict prerequisite to learn Hadoop but is highly suggested by the industry use cases.

3. Linux

Although Apache Hadoop can run on Windows it is built initially on and for Linux. Linux is the preferred method for installing and managing the Hadoop cluster. So having an understanding of the working on and using Linux also helps a lot.

When it comes to managing Hadoop Distributed File System (HDFS) from the command line, many of the commands resemble or are exactly same to the Linus shell commands. To learn about HDFS and HDFS shell command, refer to the articles:

Besides that, we also need to know linux if we want to work on deploying and configuring Hadoop cluster or even single node machine.

4. SQL

For people who are already familiar with SQL, they can make use to their existing knowledge. They can learn and use SQL like syntax on top of Hive. Apache Hive query language is almost similar to ANSI SQL. Besides Hive, Apache Pig also have many commands which are similar to SQL commands. For example, joins, group by, order by etc. Not only Apache Hadoop but other big data ecosystem tools are also providing SQL like interface so that it makes it easier for users to learn the tool who are already familiar with SQL. Cassandra and HBase being some of those tools which provide SQL like query interface for interacting with data.

5. Conclusions

As discussed, there are no strict prerequisites for starting to learn Apache Hadoop but there sure are things which we should be familiar with before digging deep in Apache Hadoop. Then we discussed these prerequisites one at a time to know where and how they are used and where we will need it. It will be good to know some or all of these prerequisites before we dive into Apache Hadoop.

Raman Jhajj

Ramaninder has graduated from the Department of Computer Science and Mathematics of Georg-August University, Germany and currently works with a Big Data Research Center in Austria. He holds M.Sc in Applied Computer Science with specialization in Applied Systems Engineering and minor in Business Informatics. He is also a Microsoft Certified Processional with more than 5 years of experience in Java, C#, Web development and related technologies. Currently, his main interests are in Big Data Ecosystem including batch and stream processing systems, Machine Learning and Web Applications.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button