Apache Hadoop

Hadoop Getmerge Example

In this example, we will look at merging the different files into one file in HDFS (Hadoop Distributed File System) and Apache Hadoop. Specifically the getmerge command.

1. Introduction

Merging is one of the tasks which is required a lot of times in Hadoop and most of the times, the number of files is large or the size of files are pretty large to be merged using text editors of any other kind of editors. That is why Hadoop and HDFS provides a way to merge files using command line. But before we go further, if you are new to Apache Hadoop and HDFS, it is recommended to go through the following articles to have a clear picture of what is happening (especially the last 3rd and 4th articles):

Once the basics are clear from the above articles, we can look into the getmerge command expalined in this example.

2. Merging Files

Merging multiple files comes IN handy frequently, sometimes the input files in HDFS are segregated and we want them to be one single file or couple of files instead of many small files. Occasionally the output of the MapReduce is multiple files when we are using multiple reducers and we want to merge them all, so that we can have a single output file from a MapReduce task.

In those cases, HDFS getmerge command is very useful, it can copy the files present in a particular path in HDFS, concatenate them and output a single file in the provided path within the filesystem.

Command Syntax:
hadoop fs -getmerge [-nl] <source-path> <local-system-destination-path>

Example:

hadoop fs -getmerge -nl /user/example-task/files /Desktop/merged-file.txt

Command Parameters:
getmerge command can take 3 parameters:

  1. Source Path: This is the HDFS path to the directory which contains all the files which need to be merged into one
  2. Destination Path: This is the local file path and name where the merged output file need to be stored.
  3. New Line (-nl): [-nl] is the optional parameter which if included will add a new line in the result file.

3. HDFS Example

In this section, we will go through the procedure step by step of how to merge the files using getmerge command and we will present the output.

    1. We will create a folder called TestFiles on the local filesystem and we will also create multiple test files in this folder. Here in the screenshot of the 6 test files.

      TestFiles folder with 6 test files.
      TestFiles folder with 6 test files.

 

  1. Each file in the test folder is just a simple text file with only one line each containing the number of the file as shown in the screenshot below for the first file:

    Test File 1
    Test File 1
  2. The next step is to upload the files to Hadoop Distributed File System (HDFS). We will use the copyFromLocal HDFS command for the purpose of copying the files from local filesystem to HDFS.
    hadoop fs -copyFromLocal Desktop/TestFiles /user/cloudera
    

    This will copy the whole TestFiles folder with all the 6 files from Desktop/TestFiles to /user/cloudera

    Next use the ls as shown below to see if the folder is created and the files are copied to the destination successfully

    hadoop fs -ls
    hadoop fs -ls /user/cloudera/TestFiles
    

    Copying files from local filesystem to HDFS.
    Copying files from local filesystem to HDFS.
  3. We can also check the content of the files copied into HDFS using the cat command. This is just an option step to make sure the content is fine, you can skip this step if you feel so.
    hadoop fs -cat /user/cloudera/TestFiles/file1.txt
    

    In the screenshoe below, content from all the 6 files is shown:

    Content of all 6 test files.
    Content of all 6 test files.
  4. Now its time to see the example of how getmerge command works. Execute the following command:
    hadoop fs -getmerge -nl /user/cloudera/TestFiles Desktop/MergedFile.txt
    

    This command will merge all the 6 file present in the /user/cloudera/TestFiles folder to one file and will be saved to Desktop/MergedFile.txt on the local file system. The merged file will have the new line characters appended at the end as we used the flag -nl>

    Following screenshot shows the command in action, if there is no response then the command is successful.

    getmerge command in action.
    getmerge command in action.
  5. Following screenshot shows the final output merged file from Desktop/MergedFile.txt

    Output of getmerge command named MergedFile.
    Output of getmerge command named MergedFile.

4. Summary

In this example article, we looked into the HDFS command getmerge in details. We started by understanding the syntax of the command and all its parameters. We then had a step by step look at the process of how the files are merged in HDFS using the command.

To summarize, the getmerge command takes 3 parametrs i.e. source path, destination path and an option new line flag. We also used couple of another commands during the example process such as ls, cat and copyFromLocal.

Let me know in the comments if you need any more details or if anything is not clear.

Raman Jhajj

Ramaninder has graduated from the Department of Computer Science and Mathematics of Georg-August University, Germany and currently works with a Big Data Research Center in Austria. He holds M.Sc in Applied Computer Science with specialization in Applied Systems Engineering and minor in Business Informatics. He is also a Microsoft Certified Processional with more than 5 years of experience in Java, C#, Web development and related technologies. Currently, his main interests are in Big Data Ecosystem including batch and stream processing systems, Machine Learning and Web Applications.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Inline Feedbacks
View all comments
Back to top button