Prerequisites for Learning Hadoop
In this article, we will dig deep to understand what are the prerequisites of learning and working with Hadoop. We will see what are the required things and what are the industry standard suggested things to know before you start learning Hadoop
1. Introduction
Apache Hadoop is the entry point or we can say the base for entering into the whole Big Data Ecosystem. It serves as the base for most of the advanced tools, applications, and frameworks in the Big Data Ecosystem but for learning Apache Hadoop also there are some things which you need to know beforehand.
There are not strict prerequisites to start learning Apache Hadoop. However, it makes things easier and if you want to become and expert in Apache Hadoop , these are the good to know things.
So two very basic prerequisites for Apache Hadoop are:
- Java
- Linux
- SQL
We will cover these in the next sections.
2. Java
Knowing Java is not a strict prerequisite for working with Hadoop but knowing it is obviously an added advantage to dig deep and understand the working of Apache Hadoop.
It might sound strange that the first prerequisite I have mentioned is Java and I am saving this is also not a strict prerequisite but an addition. Well, let us see why so.
There are tools and applications like Pig, Hive etc. which are built on top of Hadoop. These tools offer their own high-level interaction languages for working with the data stored and processed on Apache Hadoop cluster. For example, Pig Latin for Pig and HiveQL for Hive. So for people who do not want to dig deep into writing complex MapReduce applications but want to interact with the data in cluster using Hive or Pig, can skip Java.
For writing Hadoop MapReduce applications also Java is not the only option. Hadoop provides an option to use any language which can read from standard input and which can write to standard output to write MapReduce programs using the component called Hadoop Streaming. For example, Python, Ruby, C etc. But as Apache Hadoop is written in Java, to work with the components as close as possible, Java is the language to go with. Pig Latin and HiveQL commands are also converted to Java MapReduce programs internally and executed.
So, if you want to know nuts and bolts of Apache Hadoop and if the requirements become more and more complex, Java is a prerequisite for Apache Hadoop.
Note: To Leaning more about the basics of Hadoop MapReduce, Hadoop Streaming, and Hive, follow the articles below:
But still, why to use Java when we have Hadoop Steaming
For sure Hadoop Steaming provides the option to use many languages to write MapReduce programs but there are some advantages and advance features which only Java API have in Apache Hadoop as of now.
So, Java is not a strict prerequisite to learn Hadoop but is highly suggested by the industry use cases.
3. Linux
Although Apache Hadoop can run on Windows it is built initially on and for Linux. Linux is the preferred method for installing and managing the Hadoop cluster. So having an understanding of the working on and using Linux also helps a lot.
When it comes to managing Hadoop Distributed File System (HDFS) from the command line, many of the commands resemble or are exactly same to the Linus shell commands. To learn about HDFS and HDFS shell command, refer to the articles:
Besides that, we also need to know linux if we want to work on deploying and configuring Hadoop cluster or even single node machine.
4. SQL
For people who are already familiar with SQL, they can make use to their existing knowledge. They can learn and use SQL like syntax on top of Hive. Apache Hive query language is almost similar to ANSI SQL. Besides Hive, Apache Pig also have many commands which are similar to SQL commands. For example, joins, group by, order by etc. Not only Apache Hadoop but other big data ecosystem tools are also providing SQL like interface so that it makes it easier for users to learn the tool who are already familiar with SQL. Cassandra and HBase being some of those tools which provide SQL like query interface for interacting with data.
5. Conclusions
As discussed, there are no strict prerequisites for starting to learn Apache Hadoop but there sure are things which we should be familiar with before digging deep in Apache Hadoop. Then we discussed these prerequisites one at a time to know where and how they are used and where we will need it. It will be good to know some or all of these prerequisites before we dive into Apache Hadoop.