In this article we will try to address the one of the most asked question by beginners in the Apache Hadoop and Big Data ecosystem.
That is Is Hadoop a Database? or more specifically Is Hadoop Relational Database?
1. Is Hadoop a database
No Hadoop is not a database, to understand the difference we need to understand what exactly a database is and what exactly is Apache Hadoop.
1.1 Database and Relational Database
A database is a collection of data which is organized in a certain pattern in order to make it easily accessible, manageable and updatable so that people or the softwares can use it in a meaningful way.
Databases are mostly classified according to their organizational approach. The most common one being the relational database. A relational database is a tabular format in which data is defined and different relations between different type of data is defined so that it can be reorganized and accessed in different ways.
1.2 Apache hadoop
Hadoop is an open source framework for storing and processing big data in a distributed fashion on large clusters of commodity hardware. It is the open source version of the paradigm introduced by Google in their 2004 paper MapReduce
Apache Hadoop is a massively scalable storage and batch processing system. It provides and integrated storage and processing capabilities. This can be scaled horizontally with commodity hardware and provides fault tolerance.
2. Can Hadoop Replace Database
It is asked a lot that whether Hadoop can replace a database or not. There is not straight forward answer to this question. Hadoop or to be specific HDFS (Hadoop Distributed File System) can store data and there are components which can project it like a relational database structure to outside for querying but that is not the main competency of Apache Hadoop.
The main competency of Apache Hadoop is data processing and offloading heavy duty analytic work from the databases or other such systems so that they can concentrate on what they are designed for. For example, consider an RDBMS used for serving data and ensuring transactional consistency of all the data entered in it. Now if we use the same RDBMS to process this data and generate complex analytics reports from the large volume of data stored in it will not be the best strategy because it will need a significant amount of processing capabilities which can otherwise be used for the main work of the system. Now Hadoop is as we know is designed to store large amount of data in distributed fashion and then process this data in whatever way necessary. So what can be done in this example scenario is to keep RDBMS to serve the data and ensure transactional consistency and take this data from RDBMS from time to time and perform the required analytics using Apache Hadoop cluster completely separately from RDBMS.
3. Difference in Hadoop and Relational Database Management Systems
There are few specific differences in Apache Hadoop and a Relational Database Management System which we will discuss below:
- The storage mechanisms in Apache Hadoop and RDBMS are completely different. Relational databases store information in tables defined by a specific schema whereas Apache Hadoop uses key-value pair as its fundamental unit for data storage. Though there are NoSQL databases available which make use of key-value storage but none of the Relation Database do so.
- In case of relational databases, SQL is used to query the data but the only thing specified in these queries are what data is required and there is no consideration on how the data is obtained. On the other hand Apache Hadoop make use of the MapReduce programs and concentrate on both what and how
- There is also a difference in how relational database scale and how Hadoop scales. In case of relational database, a lot of horsepower need to be added to the system and we need specific database class servers but in case of Hadoop a lot of community hardware systems can be added with normal horsepower to scale it.
To summarise, Apache Hadoop is not a database storage or relational storage, its main competency is to process data in a distributed fashion. It does have a storage component called HDFS (Hadoop Distributed File System) which stoes files used for processing but HDFS does not qualify as a relational database, it is just a storage model.
There are components like Hive which can work on top of HDFS and which allows users to query the HDFS storage using SQL like queries using HiveQL but that is just SQL like queries and does not make HDFS or Apcahe Hadoop a database or relational database.