Spring Batch Introduction Example
In this post, we feature a comprehensive a Spring Batch Introduction article. Many enterprise applications need bulk processing to perform many business operations. These business operations typically include time-based events or complex business rules across very large data sets. Batch processing is used to handle these workloads efficiently. In this post, We will look at Spring Batch as a solution for these batch processing needs.
1. Spring Batch Introduction
Spring Batch is a lightweight, comprehensive batch framework which builds upon the POJO-based development approach. Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job and resource management. Spring Batch is designed to work in conjunction with various commercial and open-source schedulers such as Quartz, Tivoli, Control-M, etc.
Spring Batch follows a layered architecture with three major components – Application, Batch Core and Batch Infrastructure. Application is the client code written by developers to achieve the intended functionality. The Batch Core contains the core runtime classes necessary to launch and control a batch job while the infrastructure contains common services needed for the Batch core and Application.
Let’s start with a simple batch processing use case in the next section. Before that, We will look at the stack involved in creating the example. We will use Maven for managing the build and dependencies with Java 8 as the programming language. All the dependencies required for the example are listed in maven’s pom.xml given below
pom.xml
<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-parent</artifactId> <version>2.1.7.RELEASE</version> <relativePath/> <!-- lookup parent from repository --> </parent> <groupId>com.jcg</groupId> <artifactId>springBatch</artifactId> <version>0.0.1-SNAPSHOT</version> <name>springBatch</name> <description>Demo project for Spring Batch</description> <properties> <java.version>1.8</java.version> </properties> <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-batch</artifactId> </dependency> <dependency> <groupId>org.hsqldb</groupId> <artifactId>hsqldb</artifactId> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> </plugin> </plugins> </build> </project>
- This maven configuration indicates
Spring Boot Starter Parent
as the dependency and the version is specified as 2.1.7. All the other Spring dependencies inherit from the parent. - Java Version is specified as 1.8 for the project.
- Spring Batch is specified as the dependency for the project which is the topic of our example.
- Now, Spring Batch requires the Job metadata such as start and end to be saved into a persistent store. For this purpose,
HSQLDB
is specified as a dependency. This is an embedded database which saves the information and gets destroyed as the application exits. Spring Batch auto-creates the required tables for maintaining the job information.
2. Batch Example
A typical Spring Batch Job typically involves a Reader, Writer and optionally a Processor. A Processor is typically involved when we need to apply business rules on the data read. There is alternatively a Tasklet involved which we will delve into the next section.
In this section, We will consume a movie JSON dataset and write it to a CSV file. We will look at the entity structure of Movie which helps to understand the JSON structure.
Movie.java
package com.jcg.springBatch.entity; import java.util.List; public class Movie { private String title; private long year; private List cast; private List genres; public String getTitle() { return title; } public void setYear(long year) { this.year = year; } public void setCast(List cast) { this.cast = cast; } public void setTitle(String title) { this.title = title; } public List getGenres() { return genres; } public void setGenres(List genres) { this.genres = genres; } }
- Movie Class has four fields
- Title – This holds the movie name
- Year – This is the year in which movie was released
- Cast – This includes the actors in the movie.
- Genre – This represents the genre of the movie such as action, Comedy and thriller
- The
movies.json
is a public dataset obtained from GitHub
We will create a SpringBoot Application capable of running the Spring Batch Job. Our job is going to read all the movies and output a CSV file containing the movie and its corresponding genres.
Application.java
package com.jcg.springBatch; import org.springframework.boot.SpringApplication; import org.springframework.boot.autoconfigure.SpringBootApplication; @SpringBootApplication public class Application { public static void main(String[] args) { SpringApplication.run(Application.class, args); } }
- This is a typical SpringBoot application setup where we annotate the class to enable SpringBoot.
- Spring Boot takes an opinionated view of the Spring platform and third-party libraries. Most Spring Boot applications need very little Spring configuration reducing development time.
In the sections below, We will see various steps involved in configuring the batch job. We are going to break the Java Class BatchConfiguration
into various snippets for understanding.
BatchConfiguration.java
@Configuration @EnableBatchProcessing public class BatchConfiguration { @Autowired JobBuilderFactory jobBuilderFactory; @Autowired StepBuilderFactory stepBuilderFactory; }
- The class is annotated with
@Configuration
to ensure this is a configuration to be processed by the Spring Boot. Previously these were XML files but now Spring Boot favors Java configuration. - The other annotation
@EnableBatchProcessing
indicates that this is a batch project. - We have two builders specified
JobBuilderFactory
– used to build the movie Job. In Spring Batch Job is the top-level abstraction. Job indicates the business functionality which needs to be achieved.StepBuilderFactory
– used to build the steps involved in the Job. A job can contain multiple steps with each step fulfilling a particular task. For our Simple Job, we have only one step.
A step is where all the action begins. As indicated in the top of the section, a step contains the three components of ItemReader, ItemProcessor and ItemWriter. Spring provides out of the box reader and writer for various file formats. Considering our JSON dataset, we will look at the JsonItemReader below.
ItemReader
@Bean public JsonItemReader jsonItemReader() throws MalformedURLException { return new JsonItemReaderBuilder() .jsonObjectReader(new JacksonJsonObjectReader(Movie.class)) .resource(new UrlResource( "https://raw.githubusercontent.com/prust/wikipedia-movie-data/master/movies.json")) .name("movieJsonItemReader") .build(); }
- Spring follows the builder pattern where we provide various pieces of input required to build the entire object.
- We load the JSON data from the URL by specifying a
URLResource
as input. - We also specify the
Movie
entity to be the type to which data has to be transformed. - Rest of the configurations are just providing a suitable name for the class.
Once the reader reads the data, data is available to be consumed by the further components in the step. In our Step, We have a custom processor which processes the data from the Reader.
ItemProcessor
@Bean public ItemProcessor movieListItemProcessor() { return movie -> new MovieGenre(movie.getTitle(), movie.getGenres().toString()); }
- The processor is written as an inline lambda
- It takes in each movie and converts it to another entity
MovieGenre
which has two fields- Title – Movie Name
- Genre – Genres comma separated instead of a List
MovieGenre
class is listed below which is self-explanatory
MovieGenre.java
package com.jcg.springBatch.entity; public class MovieGenre { private String genre; public String getGenre() { return genre; } public String getTitle() { return title; } private String title; public MovieGenre(String title, String genre) { this.genre = genre; this.title = title; } }
Now we come to the final component in the step – ItemWriter.
ItemWriter
@Bean public FlatFileItemWriter movieGenreWriter() { return new FlatFileItemWriterBuilder() .name("movieGenreWriter") .resource(new FileSystemResource("out/movies.csv")) .delimited() .delimiter(",") .names(new String[]{"title", "genre"}) .build(); }
- We use
FlatFileItemWriter
to write the output to a CSV file which is specified as the resource. - We specify the delimiter to be used within a line – can be space or any other character. Since it is a CSV, a comma is specified as the delimiter.
- The column names to be consumed from the entity are specified to the names argument.
All of these components are Bean definitions specified in the configuration class. Now, a Step definition is the one which glues together all of these components.
MovieStep
@Bean public Step movieStep() throws MalformedURLException { return stepBuilderFactory .get("movieStep") .<Movie, MovieGenre>chunk(10) .reader(jsonItemReader()) .processor(movieListItemProcessor()) .writer(movieGenreWriter()) .build(); }
- Spring Batch processes the records(items) in chunks. We specify chunk size as 10 which indicates ItemReader to read 10 records at a time.
- The input (Reader Datatype) and output(Writer Datatype) type are specified explicitly in the step.
- These are then fed to the processor one by one but the output from the processor is aggregated and sent to the Writer with the specified chunk size.
The final component is the MovieJob which is explained below
MovieJob
@Bean public Job movieJob(Step movieStep) { return jobBuilderFactory.get("movieJob") .incrementer(new RunIdIncrementer()) .flow(movieStep) .end() .build(); }
- A Spring Batch Job can run multiple times. Hence to differentiate each run of the job, Spring provides a
RunIdIncrementer
which increments the run id every time the job is run. - Flow is analogous to a Step and the
movieStep
is provided here. But there are other execution flows which can also be provided.
Now to execute the job, run the class Application
and CSV file similar to the one below is generated.
movies.csv
After Dark in Central Park,[] Boarding School Girls' Pajama Parade,[] Buffalo Bill's Wild West Parad,[] Caught,[] Clowns Spinning Hats,[] Capture of Boer Battery by British,[Short, Documentary] The Enchanted Drawing,[] Feeding Sea Lions,[] ....
But this does not give information about the records in the file. To specify column headings, FlatFileItemWriter has a header callback which can be specified as .headerCallback(writer -> writer.write("Movie Title,Movie Genres"))
. This writes the header of the file even before any of the other records are written.
2.1 Listener
In the previous section, We saw the batch processing capability of Spring. But after the job completes, We did not get any statistics about the Job or step. Spring provides a listener interface using which we can listen during the lifecycle of the job. We will see the example of a StepExecutionListener
which will be executed before and after the step.
Listener
@Bean public StepExecutionListener movieStepListener() { return new StepExecutionListener() { @Override public void beforeStep(StepExecution stepExecution) { stepExecution.getExecutionContext().put("start", new Date().getTime()); System.out.println("Step name:" + stepExecution.getStepName() + " Started"); } @Override public ExitStatus afterStep(StepExecution stepExecution) { long elapsed = new Date().getTime() - stepExecution.getExecutionContext().getLong("start"); System.out.println("Step name:" + stepExecution.getStepName() + " Ended. Running time is "+ elapsed +" milliseconds."); System.out.println("Read Count:" + stepExecution.getReadCount() + " Write Count:" + stepExecution.getWriteCount()); return ExitStatus.COMPLETED; } }; }
- In the
beforeStep
method, We obtain the step name and log to the console. - We store the start time in Step’s
ExecutionContext
which is similar to a map containing a string key and can take any object as the value. - In the
afterStep
method, we log the running time using the start time stored in ExecutionContext. - We log the read record count and write record count for the step which is the original intention of adding the listener.
We have just defined the listener but have not associated the listener to the created step. We will see how we can associate the listener to the moviestep.
Listener to Step
@Bean public Step movieStep() throws MalformedURLException { return stepBuilderFactory .get("movieStep") .listener(movieStepListener()) .chunk(10) .reader(jsonItemReader()) .processor(movieListItemProcessor()) .writer(movieGenreWriter()) .build(); }
This is just one listener. We also have other listeners similar to it. For Example, there is another listener – JobExecutionListener
which executes before and after the job. It has its own ExecutionContext
for storing the job-related information. Running the job produces the following output.
Logs
2019-08-31 15:11:06.163 INFO 24381 --- [ main] o.s.b.a.b.JobLauncherCommandLineRunner : Running default command line with: [] 2019-08-31 15:11:06.214 INFO 24381 --- [ main] o.s.b.c.l.support.SimpleJobLauncher : Job: [FlowJob: [name=movieJob]] launched with the following parameters: [{run.id=1}] 2019-08-31 15:11:06.226 INFO 24381 --- [ main] o.s.batch.core.job.SimpleStepHandler : Executing step: [movieStep] Step name:movieStep Started Step name:movieStep Ended. Running time is 3340 milliseconds. Read Count:28795 Write Count:28795 2019-08-31 15:11:09.572 INFO 24381 --- [ main] o.s.b.c.l.support.SimpleJobLauncher : Job: [FlowJob: [name=movieJob]] completed with the following parameters: [{run.id=1}] and the following status: [COMPLETED] 2019-08-31 15:11:09.575 INFO 24381 --- [ Thread-5] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown initiated... 2019-08-31 15:11:09.577 INFO 24381 --- [ Thread-5] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown completed.
3. Tasklet
In this section, We will see another form of Spring Batch step – Tasklet Step. This comes in handy when the flow does not fit the pattern of Reader, Writer and processor. This is a single step executing with the same safety guarantees of restartability and fault tolerance.
ListStep
@Bean public Step listStep() { return stepBuilderFactory.get("listStep") .tasklet((stepContribution, chunkContext) -> { Resource directory = new FileSystemResource("out"); System.out.println(directory.getFile() + " directory is available"); for (File file : directory.getFile().listFiles()) { System.out.println(file.getName() + " is available"); } return RepeatStatus.FINISHED; }).build(); }
- A simple
TaskletStep
namedlistStep
is created. - It has two parameters –
StepContribution
andChunkContext
StepContribution
is much similar to theStepExecutionContext
providing context for the step.ChunkContext
is similar toStepContribution
but it provides context around the chunk being processed.
- The current step looks at the output directory and lists all the files inside the directory.
Job Definition
@Bean public Job movieJob(Step movieStep, Step listStep) { return jobBuilderFactory.get("movieJob") .incrementer(new RunIdIncrementer()) .flow(movieStep) .next(listStep) .end() .build(); }
We wire the listStep to the movieJob
in the above code snippet to chain the sequence of steps. This verifies the creation of output CSV file in the out directory.
Logs
... 2019-08-31 15:12:07.472 INFO 24390 --- [ main] o.s.batch.core.job.SimpleStepHandler : Executing step: [listStep] out directory is available movies.csv is available 2019-08-31 15:12:07.473 INFO 24390 --- [ main] o.s.b.c.l.support.SimpleJobLauncher : Job: [FlowJob: [name=movieJob]] completed with the following parameters: [{run.id=1}] and the following status: [COMPLETED] 2019-08-31 15:12:07.476 INFO 24390 --- [ Thread-5] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown initiated... 2019-08-31 15:12:07.478 INFO 24390 --- [ Thread-5] com.zaxxer.hikari.HikariDataSource : HikariPool-1 - Shutdown completed.
4. Download the Source Code
You can download the full source code of this example here: Spring Batch Introduction Example