Skip to content

Batch Processing with Spring Batch

I have blathered on about Spring Batch a few times in the past.

The June 2009 edition of GroovyMag carried my article on how to use Spring Batch with Groovy; it’s republished (with permission) here. This was my second article for GroovyMag, I have also republished the first one, on Spring Integration with Groovy.

As I lurk on the Groovy and Grails mailing lists, I see a real need “out there” for this sort of infrastructure. Hopefully, this article will contribute to a small improvement in awareness. Writing code should, after all, be the last resort and not the first…

The source code is available, of course!

Batch Processing with Spring Batch
Dealing with Large Volumes of Data using Spring Batch and Groovy

Even though a major focus of modern ideas such as Service Oriented Architectures and Software As A Service is to facilitate and enhance live interactions between systems, batch processing remains as important and widely-used as ever: practically every significant project contains a batch processing component. Until the arrival of Spring Batch (version 1.0 was released in March, 2008), no widely available, open source, reusable architecture framework for batch processing had existed; batch processing had always been approached on an ad-hoc basis. This article will examine how Groovy joins with Spring Batch to ease the pain of dealing with large data sets.

A Brief Overview of Spring Batch

Spring Batch (SB) is a relatively new member of the Spring family of technologies. There is no better introduction than this excerpt from the documentation (see the “Learn More” section for the URL):

Spring Batch is a lightweight, comprehensive batch framework designed to enable the development of robust batch applications . . . Spring Batch builds upon the productivity, POJO-based development approach, and general ease of use capabilities people have come to know from the Spring Framework, while making it easy for developers to access and leverage more advance [sic] enterprise services when necessary . . . Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management. It also provides more advance technical services and features that will enable extremely high-volume and high performance batch jobs though optimization and partitioning techniques. Simple as well as complex, high-volume batch jobs can leverage the framework in a highly scalable manner to process significant volumes of information.

SB provides a great deal of out-of-the-box functionality: very flexible adapters for reading from flat files, facilities for dealing with JMS queues, JDBC-based database adapters (of course), along with a simple workflow ability allowing conditional, repetitive and parallel processing. Sophisticated error handling/recovery capabilities and simple job control round out the package. Following the standard Spring Framework convention of “don’t reinvent the wheel,” SB does not include a complete set of scheduling/job control tools, but works instead in conjunction with existing Spring-friendly schedulers such as Quartz and Tivoli. Over time, SB will make more use of other members of the Spring family and in particular Spring Integration, and this pairing in particular should make for a formidable partnership.

SB was seeded and driven by technology and developers from Accenture–as well as SpringSource and the general Open Source community–and so claims to represent the distillation of a fair bit of experience with “real world” needs and situations.

A Small Example Application

Because SB is rooted in the standard Spring Framework, it is quite compatible with Groovy (and is also simple to integrate into Grails). As a demonstration, I’ll implement a simple batch processing job, as follows: read and parse an input data file (where each record is formatted according to a custom multiline format); validate each record (rejecting invalid records and writing the relevant record to a dump file); apply a specific transformation to each remaining (valid) record; and finally, write the valid, transformed data into an XML file.

Figure 1 provides a simplified, high-level view of the application’s overall workflow.

Figure 1: Simplified, high-level view of the workflow in the example application

Figure 1 shows how an application initiates a SB job. The job is composed of a number of steps, which are themselves composed of a number of substeps or tasklets. Tasklets are the building blocks of the application and may read/write/validate/transform or otherwise ‘munge’ the data.

Even though this is a very simple batch-processing task, there are a number of tricky areas. Consider the input data file format (an example is shown in Listing 1): this format contains data split into logical sections (customer ID, contact information, credit card information, customer balance), with one section to a line. Each line is a CSV-formatted record but some fields may themselves contain fixed-width records. The whole record is “book-ended” by BEGIN/END markers, each of which must have the same record number. The first two lines of the file are free-form commentary.

; this is a nasty file format
; it's going to be a challenge to process!
CARD,visa:1234 1234 4321 4321:000

Listing 1: Example input data record

Processing this data file is going to be quite a challenge and it is probably worth taking some time to consider how you would tackle this task in plain Java or Groovy.

It is safe to assume that the input data will contain numerous errors that makes validation and error handling a priority consideration for this application. Validation is a common necessary chore that is not difficult but is tedious and error-prone; by relying on standard Spring technologies, SB helps simplify this task.

Another minor challenge is concerned with producing the output XML document. Listing 2 shows how the record given in should be written.

<?xml version="1.0" encoding="UTF-8"?>
  <customer sequence="0000000001">
    <number>1234 1234 4321 4321</number>

Listing 2: The resultant XML-formatted data record

As is always the case with infrastructures and frameworks, one often gets the feeling of overkill when working with a simple example such as the one in this article. Keep in mind however that as a problem gets bigger, a framework becomes more and more necessary. Remember also that SB was written with these large problem spaces in mind and so you may have difficulty seeing SB’s full potential using this one simple example application.

The Driver Application

The example application uses a small command-line application to kick off the real batch job. SB actually provides a simple command-line launcher to do this, but it is instructive to see how to deal with a SB job by hand. As Listing 3 shows, the launcher is very simple and is a typically simple Spring Framework-aware application.


import o.s.batch.core.JobParametersBuilder

public class SpringBatch {

  public static void main(String[] args) {
    def context =
      new ClassPathXmlApplicationContext(['applicationContext.xml', 'job.xml'] as String[], true)

    def jobLauncher = context.getBean('jobLauncher')

    def job = context.getBean('job')

    def adjustmentPercent = 1.01D;

    def jobExecution =,
        new JobParametersBuilder().
          addDouble("adjustment.percent", adjustmentPercent).toJobParameters())
    jobExecution.with {
      println """
Job: $jobId
StartTime: $startTime; EndTime: $endTime
Duration: ${endTime.time - startTime.time} ms
      stepExecutions.each { println "STEP: $it" }

Listing 3: The Groovy driver application

Note: throughout this article, the package name prefix ‘org.springframework’ is abbreviated to ‘o.s’ to reduce line length and aid formatting and readability.

Of interest here is the creation of the Spring application context. The application is looking for two files on its classpath: applicationContext.xml defines the boilerplate SB infrastructure and job.xml defines the SB job itself. This is a standard Spring development technique. I’ll look at these files in more detail later.

The job instance is created by the Spring application context and is looked up by the application itself; this is the real meat of the batch definition, as we shall see.

The jobLauncher instance obtained from the application context is part of SB. As the name suggests, it is concerned with mediating access to a SB job instance. In this case, it will control execution of the job instance defined in job.xml and retrieved by the application. The method returns a SB jobExecution instance that allows the application to determine the state of the associated job and its composite steps.

JobParametersBuilder provides a way of defining a map of parameters that can be passed into a job and subsequently accessed by the various steps and tasklets.

Listing 4 shows (slightly edited) the application in action.

Job: 0
StartTime: Fri May 01 16:25:50 EST 2009; EndTime: Fri May 01 16:25:54 EST 2009
Duration: 3117 ms
STEP: StepExecution: id=0, name=startupStep, status=COMPLETED,. . .
STEP: StepExecution: id=1, name=processStep, status=COMPLETED,. . .
STEP: StepExecution: id=2, name=endingStep, status=COMPLETED, exitStatus=COMP. . .

Listing 4: Output from executing the application

Boilerplate Configuration

SB requires a certain amount of standard configuration to be put in place in preparation for job execution.

Listing 5 excerpts the relevant portion of the applicationContext.xml file that contains this configuration.

<bean id="transactionManager"

<bean id="jobRepository"

<bean id="jobLauncher"
      p:jobRepository-ref="jobRepository" />

Listing 5: Minimal-functionality SB configuration

This is the least sophisticated SB configuration possible. This configuration establishes a no-op transaction manager and a pure in-memory jobRepository. The latter configuration option means that no distribution, persistence or job restart capabilities are available. For the purposes of this application, this configuration is sufficient.

In this example application, all processing is sequential and synchronous with the application; however, it is possible to configure the jobLauncher instance to execute a job asynchronously to the application. An asynchronous configuration would be appropriate if using SB in conjunction with a Grails/AJAX application, which could initiate a job and then poll for status and update a visual progress indicator until the job completes.

The Job Definition

The keystone of this application is actually the job.xml application context file. Because this is quite long, I will go through it in sections.

Note: The listings shown here have been edited and excerpted to save space (while hopefully remaining clear).

The full source code for this example is supplied with this edition of GroovyMag, of course.

Listing 6 shows the place where it all begins: the job definition.

<batch:job id="job">
  <batch:step id="startupStep" next="processStep">
    <batch:tasklet ref="logStartupMessage"/>
  <batch:step id="processStep" next="endingStep">
      <batch:chunk skip-limit="100000"
          <batch:stream ref="errorItemWriter"/>
        <batch:listener ref="skipListener"/>
  <batch:step id="endingStep">
    <batch:tasklet ref="logEndingMessage"/>

Listing 6: The job definition

This definition constructs a three-stage processing pipeline.While the first and last steps merely print a status message, the middle step (with id=processStep) is the most important and what I focus on here. The processStep step identifies the various input and output processors and also defines the intermediate transformations/processes that will be executed on each record.

An important SB concept introduced here is that of a chunk. A chunk defines the processing that is to be done within a transaction boundary: a batch of records that may be written/rolled-back as a whole, for instance. For this application, each record is treated as constituting a separate chunk and so error handling, etc., is done on a per-record basis.

The batch:streams, batch:listeners and batch:skippable-exception-classes configuration elements are all related to the way that erroneous input records are handled. This will be looked at later.

Processing Step One: Input

Listing 7 defines the itemReader bean (and some of the necessary associated configuration), which deals with reading and parsing the multiline record from the input data file.

<bean id="rawDataResource" class="">
  <constructor-arg value="resource/data/inputdata.dat"/>

<bean id="itemReader" class="">
  <property name="flatFileItemReaderDelegate">
    <bean class="o.s.batch.item.file.FlatFileItemReader"
      <property name="lineMapper">
        <bean class="o.s.batch.item.file.mapping.DefaultLineMapper"
          <property name="fieldSetMapper">
            <bean class="o.s.batch.item.file.mapping.PassThroughFieldSetMapper"/>

<bean id="multilineFileTokenizer"
  <property name="tokenizers">
      <entry key="BEGIN*" value-ref="beginLineTokenizer"/>
      <entry key="CUST*" value-ref="customerLineTokenizer"/>
      <entry key="CONT*" value-ref="contactLineTokenizer"/>
      <entry key="CARD*" value-ref="cardLineTokenizer"/>
      <entry key="BAL*" value-ref="balanceLineTokenizer"/>
      <entry key="END*" value-ref="endLineTokenizer"/>

<bean id="csvLineTokenizer"

<bean id="fixedLineTokenizer"

<bean id="beginLineTokenizer"

<bean id="endLineTokenizer"

<bean id="customerLineTokenizer"

Listing 7: Input record handling

Take your time to read through this code; the itemReader is quite a sophisticated piece of infrastructure.
As you piece things together, you will see how almost everything is delegated to standard SB classes: the actual reading of the file and skipping the comment lines is deferred to a FlatFileItemReader, and the recognition and handling of the various logical sections in a record are handled by the PatternMatchingCompositeLineTokenizer class (which itself defers to a number of line tokenizers). In fact, the only custom activity here is the mapping of the parsed data to a simple application-specific class via the class. A tremendous amount of processing is being performed here with very little development effort.

Note how the use of abstract parent definitions (e.g., fixedLineTokenizer), makes it possible to write clearer configurations for several elements (e.g., beginLineTokenizer/endLineTokenizer). This is a standard Spring technique that helps to keep the configuration file DRY (i.e., without unnecessary repetition).

Again, consider how much effort would be involved if you had to do all this by hand.

Listing 8 and Listing 9 shows the MultilineRecord and MultilineRecordReader classes.


public class MultilineRecord {
  String sequence
  String endSequence

  Long id

  String mobile
  String landline
  String email

  String provider
  String number
  String security

  BigDecimal balance

  @Override public String toString() { …elided… }

Listing 8: The MultilineRecord class


import …elided…

public class MultilineRecordReader implements
  ItemReader<MultilineRecord>, ItemStream {
  private FlatFileItemReader<FieldSet> flatFileItemReaderDelegate

  public MultilineRecord read() throws Exception
    MultilineRecord mlr = null

    // flags to indicate the presence of component lines
    def customerSeen = false;
    def contactsSeen = false
    def ccSeen = false
    def balSeen = false

    def line
    while (line =
      String prefix = line.readString(0);
      switch (prefix)
        case 'BEGIN':
          mlr = new MultilineRecord(sequence: line.readString(1))

          Assert.notNull(mlr, "MultilineRecord not yet intialised")
          switch (prefix)
            case 'CUST':
     = line.readLong(1)
              customerSeen = true

            case 'CONT':
              mlr.with {
                mobile = line.readString(1)
                landline = line.readString(2)
                email = line.readString(3)
              contactsSeen = true

            case 'CARD':
              mlr.with {
                provider = line.readString(1)
                number = line.readString(2)
                security = line.readString(3)
              ccSeen = true

            case 'BAL':
              mlr.balance = line.readBigDecimal(1)
              balSeen = true

            case 'END':
              // check all record fields seen
              Assert.isTrue(mlr && customerSeen &&
                            contactsSeen && ccSeen && balSeen,
                  "Incomplete Record Found")
              mlr.endSequence = line.readString(1)
              return mlr

  … elided …

Listing 9: The MultilineRecordReader class

The MultilineRecordReader class is responsible for allocating the various fields of each tokenized line (offered as SB FieldSets) to a single new instance of a MultilineRecord. It also performs some minor validation to ensure that there are no missing logical sections. Note how Groovy’s versatile switch statement and the use of Groovy’s enhanced Object with method makes the processing much clearer than it would be in plain Java.

Processing Step Two: Output

As every first-year CompSci student learns, after input comes processing, followed by output. Naturally, I am not going to follow this sequence! In an attempt to achieve a less convoluted narrative, I’ll now look at output processing (the third stage of the processStep step). Listing 10 shows the requisite configuration for the itemWriter.

<bean id="itemWriter"
      p:marshaller-ref="multiLineRecordMarshaller" p:rootTagName="customers"

<bean id="multiLineRecordMarshaller" class="o.s.oxm.xstream.XStreamMarshaller">
  <property name="useAttributeFor">
      <entry key="sequence">
        <value type="java.lang.Class">java.lang.String</value>
  <property name="omittedFields">
      <entry key="" value="endSequence"/>
  <property name="aliases">
      <entry key="customer"

<bean id="processedOutputResource" class="">
  <constructor-arg value="resource/data/job-output.xml"/>

Listing 10: Output XML processing

The itemWriter configuration is quite straightforward. Output handling is actually delegated to a marshaller instance. In this case, the Spring OXM project is brought to bear to simplify XML generation. The XStreamMarshaller is only very minimally configurable (but quite performant…the age-old tradeoff): it is possible to render the sequence field as an XML attribute and not to render the endSequence field at all, but that is pretty much all the configuration possible.

Processing Step Three: Validation and Transformation

Now that we’ve seen the mechanics of writing a file, it is time to move on to (or more accurately: back to) the middle processing step.

Listing 11 shows the configuration of the processing tasks.

<bean id="compositeItemProcessor"
  <property name="itemProcessors">
      <ref local="validatingItemProcessor"/>
      <ref local="embiggenProcessor"/>

<bean id="validatingItemProcessor"
  <constructor-arg ref="validator"/>

<bean id="validator"
  <property name="validator">
    <bean id="luhnValidator"
      <property name="customFunctions">
          <entry key="luhn" value=""/>
      <property name="valang">
{ id : ? > 0 :
  'id field must be a natural number' }
{ endSequence : ? is not blank :
  'end sequence number field missing' }
{ sequence : ? is not blank :
  'begin sequence number field missing' }
{ endSequence : endSequence == sequence :
  'mismatched begin/end sequence numbers' }
{ mobile : match('\\d{10}',?) == true :
  'mobile field must be 10 digits' }
{ landline : match('\\d{8,10}',?) == true :
  'landline field must be 8-10 digits' }
{ provider : ? in 'amex', 'visa', 'macd' :
  'card provider field should be one of "visa", "amex" or "macd"' }
{ number : match('\\d{4}[ ]\\d{4}[ ]\\d{4}[ ]\\d{4}',?) == true :
  'card number field must match the format "xxxx xxxx xxxx xxxx"' }
{ number : luhn(?) == true :
  'card number luhn check failed' }
{ security : match('\\d{3}',?) == true :
  'card number field must be 3 digits' }
{ email : ? is not blank :
  'email field missing' }
{ email : email(?) == true :
  'email field is not a valid email address' }
{ balance : ? is not blank :
  'balance field missing' }

<bean id="errorOutputResource"
  <constructor-arg value="resource/data/errors.txt"/>

<bean id="passthroughLineAggregator"

<bean id="errorItemWriter"

<bean id="skipListener" class=""

<bean id="embiggenProcessor"

Listing 11: Configuration of processing tasks

For this application, the middle processing part of the pipeline is itself a two-stage composite process: first comes validation (and possibly rejection of invalid data), followed by transformation. This is configured through the compositeItemProcessor bean (which is referenced by the processStep step in ).
SB allows the use of any of the available Spring-compatible validation systems. For this example I have chosen to use the Valang Spring module, because of its clear declarative nature.

The validator bean contains a plaintext valang property which defines a series of expressions that should be evaluated against the various properties of the bean to which it is applied. For example:

{ id : ? > 0 : 'id field must be a natural number' }

If the given expression evaluates to false, validation fails and the associated message is added to the list of errors being maintained by the validator bean.

While Valang provides a series of standard functions (such as email, which checks to ensure that a field contains a valid email address), it cannot account for all possible requirements. It can be augmented via the customFunctions property, however, and it is this ability that allows me to define an application-specific function. To illustrate this, I’ll introduce a check for the validity of the credit card number field (using the so-called Luhn function; see “Learn More” for a reference to how this works), as is shown in Listing 12.


import …elided…

public class LuhnFunction extends AbstractFunction {

  public LuhnFunction(Function[] arg0, int arg1, int arg2) {
    super(arg0, arg1, arg2);

  protected Object doGetResult(Object target) throws Exception {
    def str = getArguments()[0].getResult(target).toString()

  public static boolean isValid(String cardNumber) {
    def sum = 0
    def addend = 0
    def timesTwo = false

    cardNumber.replaceAll(' ', '').each {dc ->
      def digit = Integer.valueOf(dc)
      if (timesTwo)
        addend = digit * 2
        if (addend > 9)
          addend -= 9;
        addend = digit
      sum += addend
      timesTwo = !timesTwo

    (sum % 10) == 0

Listing 12: Luhn function class

This allows the use of the luhn() function as if it were a built-in Valang function:

{ number : luhn(?) == true : 'card number luhn check failed' }

Valang is a powerful and effective validation framework. Its most important feature is probably that since it uses “near natural language” configuration, an application’s validation rules can be reviewed and changed by any appropriate product owner or business representative and this can make for a higher quality product. Valang is a standalone framework and well worth further study (see the “Learn More” section for the URL).

Refer back to Listing 6. You will see that in the event of a ValidationException, processing of a chunk is skipped. This application registers a listener for this situation that simply writes the offending record to a configured itemWriter. Listing 13 shows the appropriate class.


import org.springframework.batch.core.listener.SkipListenerSupport

public class SkipListener extends SkipListenerSupport<MultilineRecord, Object> {

  def writer

  public void onSkipInProcess(MultilineRecord item, Throwable t) {
    writer.write ( [ item ] )

Listing 13: Error handling skip listener class

The configured itemWriter has an associated line aggregator that performs preprocessing before the actual write takes place. In this case, the PassThroughLineAggregator class simply performs a toString operation on the presented item, as Listing 14 shows.


import org.springframework.batch.item.file.transform.LineAggregator

public class PassThroughLineAggregator implements LineAggregator<MultilineRecord> {
  public String aggregate(MultilineRecord item) {

Listing 14: PassThroughLineAggregator class

The second part of the composite processing step shown in deals with transforming the now-valid data record. This particular transformation represents a business rule: as each record is transformed, its balance should be adjusted by a certain percentage. This requires a small piece of Groovy, the EmbiggenProcessor. Listing 15 shows this simple class in its entirety.


import org.springframework.batch.item.ItemProcessor

public class EmbiggenProcessor
  implements ItemProcessor<MultilineRecord, MultilineRecord> {

  def percent

  public MultilineRecord process(MultilineRecord mlr) throws Exception {
    mlr.balance *= percent


Listing 15: The EmbiggenProcessor class

If you refer back to Listing 3 and Listing 11, you will see how the percent value is injected into the EmbiggenProcessor class from the application via the job.xml file.


To paraphrase Benjamin Disraeli, first Earl of Beaconsfield: “there are lies, damn lies and performance measurements.” Since many batch jobs deal with very large data sets, the performance of SB is bound to be of paramount interest to some. I am a coward! I am going to merely dip my toe into this potentially turbulent topic and just let you know that on my laptop, the complete processing of 42,500 records took 92,381 ms. That makes about 2.17 ms per record. Not too shabby in my opinion–but of course, your mileage may vary.

Wrapping Up

You’ve now walked through a complete SB application. A large proportion of the application is declarative configuration. I like this style of working: it reduces the amount of ‘real’ coding that is required and thus minimizes the opportunity for error. For those parts of the problem space that are not directly covered by the standard components, Groovy has proved to be a very effective tool; with minimal effort and coding, Groovy has allowed me to very effectively concentrate on creating a clear solution, which also minimizes the opportunity to introduce bugs.

I continue to find it impressive that Groovy—a dynamic language—can still effectively work with highly-typed interfaces such as those exposed by SB: checked exceptions, strongly typed parameter lists, and even generic classes can be handled with ease. At the same time, Groovy can let me work with frameworks such as Valang without forcing me to deal with ‘nasties’ such as adapter classes, proxies, out-of-process adapters, etc. This means that an inveterate Java programmer like myself can continue to apply his existing skillset—with its hard-won collection of lessons learned, tricks and tips picked up over time—while also taking advantage of the productivity and ease of use of Groovy’s modern dynamic language. In my not-so-humble opinion, this is important and will surely contribute to a rapid increase in Groovy’s popularity.

Learn More

Spring Batch
Spring Valang module
Spring OXM
The Luhn Function

Tags: , , ,

C, Java Enterprise Edition, JEE, J2EE, JBoss, Application Server, Glassfish, JavaServer Pages, JSP, Tag Libraries, Servlets, Enterprise Java Beans, EJB, Java Messaging Service JMS, BEA Weblogic, JBoss, Application Servers, Spring Framework, Groovy, Grails, Griffon, GPars, GAnt, Spock, Gradle, Seam, Open Source, Service Oriented Architectures, SOA, Java 2 Standard Edition, J2SE, Eclipse, Intellij, Oracle Service Bus, OSB