Amazon AWS

Tutorial: How to use AWS Lambda with S3 for real-time data processing

A company recently asked me to develop a solution to receive batch data from a third party data vendor, run business logic against the data, and store the resulting values in a database. This demo will show how to build such a process using AWS and Java. We will receive a sample grade book via a batch file in AWS S3, calculate an overall grade for each student in the grade book using an AWS Lambda function and store the results in a DynamoDB table.

1. Introduction

Why use AWS Lambda? Lambda (not to be confused with lambda expressions introduced in Java 8) is a serverless compute platform (Function as a Service) where our Java code will run. As you might expect, there is still a server, however the burden of responsibility has been placed on AWS. This means that provisioning, monitoring, patching, scaling, and other infrastructure activities are managed by AWS. After code is deployed it can be established to run automatically when some external event occurs such as a RESTful call or a new S3 object received. For services that only need to run on occasion and in response to a trigger, Lambda is an excellent solution.

Lambda is an excellent choice compared to cloud compute platforms like AWS EC2 when it isn’t necessary to have control of server instances, when code is only going to run in response to another event, or for code that is run infrequently. For example, Netflix uses Lambda to help automate the encoding process of media files.

S3 (Simple Storage Service) is a storage solution for bulk data. Amazon describes it as “secure, durable, highly-scalable cloud storage.” Common uses include hosting static websites, big data objects, and holding objects for processing by other AWS services.

DynamoDB is a NoSQL database which is extremely fast, with single-millisecond latency, and delivers consistent read and write executions. Like Lambda, it’s fully managed and scalable. It provides an advantageous database solution for simple read and write processes without complex joins.

2. Setup

First you will need to get an AWS account and set up the CLI. Amazon has instructions to do this at http://docs.aws.amazon.com/lambda/latest/dg/setup.html

Additionally, we will follow setup using Eclipse IDE and AWS SDK plugin. Amazon has setup instructions here: http://docs.aws.amazon.com/toolkit-for-eclipse/v1/user-guide//getting-started.html

Once you have set up CLI and your adminuser account, log in at https://signin.aws.amazon.com/console

2.1 S3

We will create two buckets, one to store the raw data to be processed and a second to store our Java code.

After logging in to the AWS console, select from the top menu:

  • AWS > Storage and Content Delivery > S3
  • Select the ‘Create Bucket’ button

Amazon policy allows for names with lowercase letters, numbers, periods (.), and hyphens (-). Your bucket must have a unique name across all of AWS. Amazon recommends DNS compliant names.

For this tutorial I suggest creating your first bucket with the name yournames3gradebookexample. While my examples use US West (Oregon), also referred to as us-west-2, there are a number of considerations to make when choosing a region that we will not explore in this tutorial, including cost, SLA, and latency. For this demo US West (Oregon) will be a good default.
There is no need to copy any settings; choose the ‘Create’ button on the bottom left.

Create S3 bucket
Create S3 bucket

Repeat the same steps and create a second bucket to store code. Call this yournamelambdacode. Create this bucket in the same region.

2.2 DynamoDB

Similar to the first steps above, select from the top menu:

  • AWS > Database > DynamoDB
  • Select ‘Create Table’
  • For Table Name, enter ‘Students’.
  • For Partition Key, enter ‘StudentID’ and set the type to Number.
    • The sort key option is only needed if the partition key can have duplicate entries. We won’t need one for our example.
  • Leave the ‘default settings’ option checked.
  • Select the ‘Create’ button at the bottom of the page.

Create DynamoDB table
Create DynamoDB table

2.3 Lambda

  • From Eclipse, choose File > New > Project… > AWS > AWS Lambda Java Project
  • Add a project name. I called mine Gradebook. This is only used for your local file system.
  • Group ID and Artifact ID are Maven specific. It is ok to use the default if you’re unsure what these are for. I used the names com.zackroppel.lambda and gradebook for these.
  • Package name should match a combination of Group ID and Artifact ID, you don’t need to change this.
  • Call your class LambdaGradebook.
  • Set Input Type to S3 Event.
  • Click Finish.

Create Lambda function
Create Lambda function

What you now have is a simple Lambda example which can be deployed to AWS. In your src/main/java directory in your project, you’ll see your LambdaGradebook class. This class implements the AWS RequestHandler interface and overrides handleRequest(), a method in that interface. As generated this method looks for a file in a given S3 bucket and returns a String containing the content type (we will see the input for this shortly).

We will first test to make sure the skeleton setup is complete and allows us to read from S3 in our Lambda function, and further in the tutorial we will make modifications to this class to calculate grades and store results in DynamoDB.

2.4 IAM Role

You will need to create an IAM role in order for your Lambda function. Doing this sets the access levels that your Lambda function has within AWS. To do this:

  • Go to the AWS console > IAM
  • Select Roles > Create New Role
  • In the dialog for Select role type, choose AWS Lambda in the AWS Service Role category
  • Attach Policy > Select boxes for both AWSLambdaExecute and AmazonDynamoDBFullAccess
  • Call your role lambda-s3-execution-role
  • Create the role.

You have now created a role with full access on DynamoDB resources and the ability to read and write S3 resources.

Attach IAM Policies
Attach IAM Policies

2.5 Deploy Lambda code to AWS

In Eclipse, right click your code and select Amazon Web Services > Upload function to Lambda…

Upload Function to Lambda
Upload Function to Lambda

In the dialog:

  • Make sure the region is consistent with what you chose for S3.
  • Create a new Lambda function: Gradebook
  • Click to the next page
  • Description is optional
  • Handler should be preselected, leave as is
  • Check that your IAM role matches the lambda-s3-execution-role set in step 2.4
  • Match the S3 bucket with the previously created yournamelambdacode.

Lambda Function Configuration
Lambda Function Configuration

Once your function successfully uploads you’ll see the name of your lambda function in brackets in your Eclipse project explorer appended to your project name. If you return to your AWS Console for Lambda you’ll see the function just created.

Lambda Function Uploaded
Lambda Function Uploaded

2.6 Input file

We need to create a sample CSV file to work with. In a text editor, create a file with the following two lines

100,91,88,79,99
101,88,75,90,83

Save the file in your local filesystem as grades.csv and upload it via the AWS Console at S3 >yournames3gradebookexample > Upload > Add Files > Upload.

2.7 Manual job processing

Now we’ll invoke the function manually to see that set up is successful. In the AWS console, go to Lambda > Gradebook > Actions > Configure Test Event

In the dialog that appears, choose a sample event template of S3 Put from the dropdown. Then make the following edits as shown in the sample below:

Test Configuration
Test Configuration

  • In the lines for bucket ARN and bucket name, update the sample bucket with your first S3 bucket yournames3gradebookexample.
  • Update your S3 key to grades.csv.
  • Make sure your AWS region matches as previously used.
  • Click ‘Save and test’.

The application will run and you will see a successful result which logs the content type as “text/csv” on the second line of the log output.

Successful Test
Successful Test

2.8 Gradebook logic

We’re going to modify our code to calculate grades and store results in a DynamoDB table. The code will be updated like the following:

package com.yourname.lambda.gradebook;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

import com.amazonaws.regions.Regions;
import com.amazonaws.services.dynamodbv2.AmazonDynamoDB;
import com.amazonaws.services.dynamodbv2.AmazonDynamoDBClientBuilder;
import com.amazonaws.services.dynamodbv2.document.DynamoDB;
import com.amazonaws.services.dynamodbv2.document.Item;
import com.amazonaws.services.dynamodbv2.document.Table;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.events.S3Event;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.services.s3.model.GetObjectRequest;
import com.amazonaws.services.s3.model.S3Object;

public class LambdaGradebook implements RequestHandler<S3Event, String> {

	private AmazonS3 s3 = AmazonS3ClientBuilder.standard().build();
	private AmazonDynamoDB client = AmazonDynamoDBClientBuilder.standard().withRegion(Regions.US_WEST_2).build();
	private DynamoDB dynamoDB = new DynamoDB(client);

	private static String tableName = "Students";

	public LambdaGradebook() {
	}

	// Test purpose only.
	LambdaGradebook(AmazonS3 s3) {
		this.s3 = s3;
	}

	@Override
	public String handleRequest(S3Event event, Context context) {
		context.getLogger().log("Received event: " + event);

		// Get the object from the event and show its content type
		String bucket = event.getRecords().get(0).getS3().getBucket().getName();
		String key = event.getRecords().get(0).getS3().getObject().getKey();
		try {
			S3Object response = s3.getObject(new GetObjectRequest(bucket, key));
			String contentType = response.getObjectMetadata().getContentType();

			BufferedReader br = new BufferedReader(new InputStreamReader(response.getObjectContent()));
			// Calculate grade
			String csvOutput;
			while ((csvOutput = br.readLine()) != null) {
				String[] str = csvOutput.split(",");
				int total = 0;
				int average = 0;
				for (int i = 1; i < str.length; i++) {
					total += Integer.valueOf(str[i]);
				}
				average = total / (str.length - 1);
				createDynamoItem(Integer.valueOf(str[0]), average);
			}
			return contentType;
		} catch (IOException e) {
			e.printStackTrace();
			context.getLogger().log(String.format("Error getting object %s from bucket %s. Make sure they exist and"
					+ " your bucket is in the same region as this function.", bucket, key));
			return e.toString();
		}

	}

	private void createDynamoItem(int studentId, int grade) {

		Table table = dynamoDB.getTable(tableName);
		try {

			Item item = new Item().withPrimaryKey("StudentID", studentId).withInt("Grade", grade);
			table.putItem(item);

		} catch (Exception e) {
			System.err.println("Create item failed.");
			System.err.println(e.getMessage());

		}
	}
}

The logic now computes an average student grade and calls the method createDynamoItem() to add the student ID and grade to the Students table.

Repeat the steps in 2.5 using the existing Lambda function Gradebook. Return to the AWS console in your browser. Go to Lambda > Functions > Gradebook > Test. The test configuration previously used will be run again.

In the AWS console, go to DynamoDB > Tables > Students > Items. You should see student records as shown below:

DynamoDB Students
DynamoDB Students

2.9 Automation

Now that we have a working pipeline, we want to automate processing of data when new CSV files are received. Run the following CLI command in your terminal, replacing the values in parentheses:

aws lambda add-permission \
--function-name Gradebook \
--region (us-west-2) \
--statement-id (gradebook-unique-id) \
--action lambda:InvokeFunction \
--principal s3.amazonaws.com \
--source-arn arn:aws:s3:::(yournamelambdacode) \
--source-account (bucket-owner-account-id) \
--profile adminuser
  • Check that your region matches
  • Statement ID can be anything you want and does not necessarily need to be changed
  • Append your source bucket to the end of source-arn
  • Append your account ID to the source-account, removing any dashes.
    • Your account ID is a 12 digit number which appears on the top right side when you log in to the browser console. Enter it without dashes.

Repeat step 2.6 with a new .csv file that has different student IDs (I added 103 and 104 to mine).

Verify in your DynamoDB console that your new records are in the table:

Students DynamoDB Table with 4 Values
Students DynamoDB table with 4 values

3. Summary

We have set up an AWS environment from scratch that takes batch files, applies business logic to them, and saves the results in a NoSQL database. We were able to do this without having any responsibility on the infrastructure. We were able to deploy our code automatically using the AWS Toolkit for Eclipse. The result is a highly available and scalable solution that runs automatically as needed and allows us to focus on our code.

4. Download The Source Code

Download
You can download the full source code of this example here: Gradebook

Zack Roppel

Zack Roppel is a senior software engineer based out of Seattle, Washington. He holds a degree in Computer Science from California State University, Sacramento. He is fascinated by big data, cloud, machine learning, server-side JavaScript and all things Java.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dedicatedhosting4u
5 years ago

Nice Blog with very interesting and useful information on your website. Thanks for sharing the blog and this great information which is definitely going to help us.

cheapest dedicated server host

Navaneet Badami
Navaneet Badami
5 years ago

Throwing error : not authorized to perform: dynamodb:PutItem on resource:
Fix : Add policy: AmazonDynamoDBFullAccess

Back to top button