Apache Hadoop Mahout Tutorial
1. Introduction
This is an in-depth article related to the Apache Hadoop Mahout. It is used in Machine Learning solutions with Hadoop. It was developed by Facebook. Hadoop and Mahout are Apache Opensource projects now. Apache Mahout was part of the Lucene project in 2008. It became an independent project in 2010.
2. Apache Hadoop Mahout
2.1 Prerequisites
Java 7 or 8 is required on the Linux, windows, or mac operating system. Maven 3.6.1 is required. Apache Hadoop 2.9.1 and Mahout 0.9 are used in this example.
2.2 Download
You can download Java 8 can be downloaded from the Oracle web site . Apache Maven 3.6.1 can be downloaded from Apache site. Apache Hadoop 2.9.1 can be downloaded from Hadoop Website. You can download Apache Mahout 0.9 from the Apache Mahout website.
2.3 Setup
You can set the environment variables for JAVA_HOME and PATH. They can be set as shown below:
Setup
JAVA_HOME="/desktop/jdk1.8.0_73" export JAVA_HOME PATH=$JAVA_HOME/bin:$PATH export PATH
The environment variables for maven are set as below:
Maven Environment
JAVA_HOME=”/jboss/jdk1.8.0_73″ export M2_HOME=/users/bhagvan.kommadi/Desktop/apache-maven-3.6.1 export M2=$M2_HOME/bin export PATH=$M2:$PATH
2.4 How to download and install Hadoop and Mahout
After downloading the zip files of Hadoop and Mahout they can be extracted to different folders. The libraries in the libs folder are set in the CLASSPATH variable.
2.5 Apache Mahout
Mahout comes from the word “elephant”. Apache Mahout is used for developing solutions involving ML algorithms like Recommendation, Classification, and Clustering. Mahout has features such as data mining framework, data analysis, clustering implementation, classification implementation, evolutionary programming techniques, and matrix and vector libraries. Social media companies like facebook, yahoo, linkedin, and foursquare use mahout. Mahout ecommerce framework has recomender engine which helps in identifying places on Foursquare. Twitter has chosen Mahout for user inteterest modelling and yahoo uses it for pattern matching.
2.6 Apache Hadoop Configuration
You need to configure HADOOP_HOME
as below:
Setup
export HADOOP_HOME=/users/bhagvan.kommadi/desktop/hadoop-2.9.1/
You need to configure $HADOOP_HOME/etc/hadoop/core-site.xml
as below:
Core Site XML file
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. See accompanying LICENSE file. --> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://apples-MacBook-Air.local:8020</value> </property> </configuration>
You need to start running Hadoop by using the command below:
Hadoop Execution
cd hadoop-2.9.1/ cd sbin ./start-dfs.sh
The output of the commands is shown below:
Hadoop Execution
apples-MacBook-Air:sbin bhagvan.kommadi$ ./start-dfs.sh 20/09/14 20:26:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Starting namenodes on [apples-MacBook-Air.local] apples-MacBook-Air.local: Warning: Permanently added the ECDSA host key for IP address 'fe80::4e9:963f:5cc3:a000%en0' to the list of known hosts. Password: apples-MacBook-Air.local: starting namenode, logging to /Users/bhagvan.kommadi/desktop/hadoop-2.9.1/logs/hadoop-bhagvan.kommadi-namenode-apples-MacBook-Air.local.out Password: localhost: starting datanode, logging to /Users/bhagvan.kommadi/desktop/hadoop-2.9.1/logs/hadoop-bhagvan.kommadi-datanode-apples-MacBook-Air.local.out Starting secondary namenodes [0.0.0.0] Password: 0.0.0.0: starting secondarynamenode, logging to /Users/bhagvan.kommadi/desktop/hadoop-2.9.1/logs/hadoop-bhagvan.kommadi-secondarynamenode-apples-MacBook-Air.local.out 20/09/14 20:27:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
2.7 Apache Hadoop Mahout
Let us start with basic usecase for Mahout – Recommendation Engine. First step is to create a data model. Below is the data model.
Hadoop Execution
1,11,2.0 1,12,5.0 1,13,5.0 1,14,5.0 1,15,4.0 1,16,5.0 1,17,1.0 1,18,5.0 2,10,1.0 2,11,2.0 2,15,5.0 2,16,4.5 2,17,1.0 2,18,5.0 3,11,2.5 3,12,4.5 3,13,4.0 3,14,3.0 3,15,3.5 3,16,4.5 3,17,4.0 3,18,5.0 4,10,5.0 4,11,5.0 4,12,5.0 4,13,0.0 4,14,2.0 4,15,3.0 4,16,1.0 4,17,4.0 4,18,1.0
The PearsonCorrelationSimilarity
class is used to create usersimilarity. It takes the userdata model in the constructor. The data model file has the User, Item, and Preference columns related to the product. The data model file is passed as file in the FileDataModel
constructor.
Data Model
DataModel usermodel = new FileDataModel(new File("userdata.txt"));
The next step is to create UserSimilarity
by using PearsonCorrelationSimilarity
as shown below in the RecommenderBuilder
class.
User Similarity
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); } }
ThresholdUserNeighborhood
is used for the UserNeighborhood
class. ThresholdUserNeighborhood
is a neighborhood for all the users whose similarity to the given user meets or exceeds a certain threshold. In the below code, threshold is set as 2.0.
User Similarity
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); } }
The next step is to create GenericUserbasedRecomender
as shown below:
User Recommender
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }
The next step is to invoke the recommend()
method of Recommender
interface. Method parameters are the user id and the number of recommendations. Below code shows the implementation:
User Recommender
Recommender recommender = recommenderBuilder.buildRecommender(usermodel); List recommendations = recommender.recommend(3, 1); System.out.println("Recommendations "+recommendations); for (RecommendedItem recommendationItem : recommendations) { System.out.println(recommendationItem); } }catch(Exception exception){ exception.printStackTrace(); }
Below is the complete class which shows the implementation of RecommendationEngine
.
Recommendation Engine
import java.io.File; import java.util.List; import org.apache.mahout.cf.taste.eval.RecommenderBuilder; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.RecommendedItem; import org.apache.mahout.cf.taste.recommender.UserBasedRecommender; import org.apache.mahout.cf.taste.recommender.Recommender; import org.apache.mahout.cf.taste.common.TasteException; import org.apache.mahout.cf.taste.similarity.UserSimilarity; import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood; public class RecommendationEngine { public static void main(String args[]){ try{ DataModel usermodel = new FileDataModel(new File("userdata.txt")); System.out.println(usermodel); RecommenderBuilder recommenderBuilder = new RecommenderBuilder() { public Recommender buildRecommender(DataModel model) throws TasteException { UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); return new GenericUserBasedRecommender(model, neighborhood, similarity); } }; Recommender recommender = recommenderBuilder.buildRecommender(usermodel); List recommendations = recommender.recommend(3, 1); System.out.println("Recommendations "+recommendations); for (RecommendedItem recommendationItem : recommendations) { System.out.println(recommendationItem); } }catch(Exception exception){ exception.printStackTrace(); } } }
To compile the code above, you can use the command below:
Recommendation Engine
javac -cp "/Users/bhagvan.kommadi/desktop/mahout-distribution-0.9/*" RecommendationEngine.java
To execute the code, the command below is used.
Recommendation Engine
java -cp "/Users/bhagvan.kommadi/desktop/mahout-distribution-0.9/*:.:/Users/bhagvan.kommadi/desktop/mahout-distribution-0.9/lib/*" RecommendationEngine
The output when the above command is executed is shown below.
Recommendation Engine
apples-MacBook-Air:apache_mahout bhagvan.kommadi$ java -cp "/Users/bhagvan.kommadi/desktop/mahout-distribution-0.9/*:.:/Users/bhagvan.kommadi/desktop/mahout-distribution-0.9/lib/*" RecommendationEngine log4j:WARN No appenders could be found for logger (org.apache.mahout.cf.taste.impl.model.file.FileDataModel). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. FileDataModel[dataFile:/Users/bhagvan.kommadi/Desktop/JavacodeGeeks/Code/apache_mahout/userdata.txt] Recommendations [RecommendedItem[item:10, value:1.0]] RecommendedItem[item:10, value:1.0]
3. Download the Source Code
You can download the full source code of this example here: Apache Hadoop Mahout Tutorial