Apache Hadoop as a Service Options
In this article, we will have a look at the available option for making use of Hadoop as a service aka HDaaS. Implementing Hadoop Cluster on own/in-house infrastructure is a complex task in itself and need a dedicated and expert team. To solve this complexity, there are many vendors providing cloud implementations of Hadoop clusters and we will have a look at some of these options.
1. Introduction
Apache Hadoop as a big data processing infrastructure is quite popular and claims to be used in 80% of the use cases of big data processing. It is popular as it provides limitless environment for big data processing using community hardware. Expanding Hadoop based on the requirement is quite easy and simple process and it does not affect the already running applications in any negative way. But there is a downside also. Hadoop is complex and it requires significant knowledge and training and expert IT staff to handle Hadoop Ecosystem.
But as every downside have solutions, these issues also have solutions and that too with a lot of options to choose from. Cloud infrastructure comes to rescue in such scenarios. There are many vendors out there who provide Hadoop-as-a-Service on top of their cloud offering.
Hadoop-as-a-Service market is dominated by many large and medium sized service vendors. This market is ever growing with many big data analytics companies also entering this market and providing data analytics service on their cloud HDaaS services and also providing bare-bone HDaaS services.
Availability of Hadoop as a Service makes things a lot easier than implementing Hadoop cluster on the premises and also it makes the implementation of big data applications easier and quick. Making use of the Hadoop technology and cluster is a bit difficult without proper training and technology. Hadoop as a service has made this transition and implementation easier. There are many providers in the market which provide options to leverage Hadoop as a service on cloud and in this article we will look at some of these options.
Running Hadoop on the cloud as HDaaS is not cheap but it a lot less than setting up in-house Hadoop Clusters. It also eases the Hadoop cluster management requirement and a requirement of dedicated IT team to handle and maintain the cluster on premises.
2. Things to consider before deciding vendors
There are few basic things which need to be considered before deciding on the vendors for Hadoop-as-a-Service. These are the most basic features which need to be considered before choosing the vendor and are most important for problem free running of the applications on the cluster.
- Performance level and Quality of Service: Running an application will need to transfer a lot of data in an out of the cloud which naturally results in a little latency. But before deciding on the vendor the performance of their service and the quality of service they provide need to be given due diligence so that issue like high latency and slow processing are not common.
- Highly elastic compute environment: Hadoop can maintain high elastic clusters for varying workload. But when dealing with on cloud service, it is even more important to consider whether the vendor have highly elastic compute environment because we are already dealing with network delays and it will not be good to add computation delay also in the latency. The vendor must maintain highly dynamic and elastic environments.
- Persistent data storage in HDFS Hadoop does not make it compulsory to use HDFS as a persistent data store, any other compatible data store can also be used but HDFS is the most preferred one. As HDFS is a native implementation it works seamlessly with Yarn and MapReduce and with the introduction of In-memory caching it is at par with any third party implementation.
- Availability of non-stop operations Recovering from the processing failures is quite important in Hadoop clusters. If this capability is not there and the whole job need to be restarted due to a processing failure it will be a wastage of money, time and resources. Make sure the vendor provides non-stop operations i.e. capability to restart an operation from the beginning of a failure sub-service and not from the beginning of the entire job.
These are not the only considerations which need to be compared before choosing a vendor but are very important and basic features which should be available for problem free management.
3. Hadoop as a Service Options
In this section, we will have a look at some of the available options and available vendors which provide Hadoop as a Service on their own managed cloud infrastructure or are compatible with other cloud infrastructure providers.
3.1 Amazon EMR
Amazon Elastic MapReduce (Amazon EMR) is one of the most famous and widely used service for quick and cost effective data processing with large amount of data. It provides a managed Hadoop Framework implementation which can process vast amount of data across dynamically scalable Amazon Elastic Compute Cloud (EC2) instances. Amazon makes use of its already available cloud services to provide the Hadoop as a service. Not just Hadoop MapReduce, Amazon EMR also provides other distributed frameworks like Apache Spark and Presto by default.
3.2 IBM InfoSphere BigInsights
IBM InfoSphere BigInsights provides Hadoop as a service using open source Apache Hadoop implementation on IBM’s own cloud called Softlayer global cloud infrastructure. BigInsignts provides the analytics services also using which users can analyse and model large amount of data with ease. It provides good flexibility with structured, semi-structured and unstructured data processing possibilities.
3.3 EMC2
EMC2 is also a large player with multiple offerings under the name Greenplum. They provide Hadoop as a service called Greenplum Apache Hadoop Distribution along with other services like Greenplum Data Computing Appliance, Greenplum Database, Greenplum Chorus etc.
3.4 Microsoft’s HDInsight
Microsoft’s HDInsight is a Hadoop Cloud service option which can scale to petabytes of data if required. It can process unstructured and semi-structured data. HDInsight is also based on open source Apache Hadoop and thus provides a good amount of flexibility with the type of data which can be processed. It also provides options to be deployed on Windows as well as Linux instances and also supports multiple development languages including Microsoft’s own .Net, Java etc.
3.5 Google-Qubole Service
Google and Qubole have partnered to provide fully elastic Hadoop-as-a-Service offering. This takes advantage of Google Compute Engine’s high performance, reliable and scalable infrastructure and Qubole’s auto-scaling, self-managing and integrated implementation to use Hadoop-as-a-Service directly on Google Cloud Platform. Using this service, users can run MapReduce jobs directly on data stored in Google Cloud Storage and BigQuery without copying data to local disk and running a standalone HDFS (Hadoop Distributed File System).
3.6 HP Cloud
HP Cloud provides an elastic cloud computing and cloud storage platform to analyse and index large data volumes which can range upto hundreds of petabyte of data. HP Helion Public Cloud provides the underlying infrastructure required for the analysis and indexing.
3.7 Altiscale
Altiscale is another vendor providing Hadoop as a cloud service as their main offering using Apache Hadoop. They also provide operation support for Hadoop Services which users run on their cloud service. Altiscale says their implementation of Apache Hadoop is purpose built and optimised, more reliable and easy to use than other service providers.
3.8 Infochimps
Cloud::Hadoop is a cloud service provided by Infochimps Cloud. Infochimps provides advanced elastic spin-up/spin-down capabilities, scalability, and customization on the fly. Besides Hadoop it provides other tools also like Hive, Pig, Wulong etc.
3.9 Teradata Analytics in the Cloud
Teradata provides a purpose-built and managed environment which can be deployed in their managed cloud, in other cloud providers like Amazon Web Services and also in-house infrastructure.
3.10 Pentaho Cloud Business Analytics
Pentaho provides a platform which can run both on cloud infrastructure like Amazon Web Services, Google Cloud etc. as well as on in-house Hadoop cluster infrastructure. It provides a highly flexible platform for blending, orchestrating, and analyzing data from a lot of sources. Pentaho can seamlessly integrate and analyze leading Big Data sources in the Cloud, Access and transform data from web services and enterprise SaaS applications.
4. Conclusion
Hadoop architecture requires a highly scalable and dynamic computing infrastructure and Hadoop experts to handle this setup but if the business decides to use Hadoop-as-a-Service offering they will not have to hire those experts and can get the services from the vendor. The more expertise, customized configuration and capacity the customer needs, the more expensive the service is but usually these expenses are less that running large Hadoop clusters on site. So if you are looking for setting up Hadoop Cluster, make sure to comapre the costs of in-house infrastructure with these service providers and choose wisely.