Sunday, September 23, 2012

Java Framework for Batch Applications

Batch applications are most common in the enterprises today. They seem to have a role even in the modern enteprise landscape, probably because it represents real-life nature of work we do in organizations. For e.g. we do end-of-day processing of checques, scheduled processing of orders, settlement of bills in hotels done at the end of day. Hence the relevance of batch frameworks will remain in advanced landscape too. Typically looking at the enterprise milieu, there are many scenarios that demand once a day processing or even less frequency of processing in a day. For e.g.
1. Report generation for management information,
2. Database synchronization at the end of the day,
3. Extraction of report information for industry feed etc.
 May be the way it is applied can change. Many legacy systems and mainframes process information in a batch model to handle work-loads efficiently.

In enteprises, there could be different mechanism to handle batch processing. Typically ETL tools are using in a sourc-to-target information exchange that includes translation and data-quality needs. These tools provide out-of-box features to manage transformation needs. But, there are also simple programs that are scheduled to run at specific intervals to handle batch processing needed. For e.g.
1. Standalone Java programs scheduled on Unix with Autosys or Unix scheduler
2. Standalone Unix shell scripts that are run on the Unix server periodically.
3. Database stored procedures that produces flat files for report generation. They could be efficient in data-synchronization, being close to the target data store.

Lets look at the important capabilities needed in a batch applications:
1. Ability to process large files within available window time,
2. Ability to handle files or information in different formats,
3. Process information in parallel and multi-threaded models.
4. Be flexible and extendible to accomodate customized transfromation, persistence and conversion logic,
5. Persist meta-data to enable auditing, monitoring and management.
6. Ablility to provide sync-point to enable restart
7. Handle data errors and exceptions gracefully,
9. Support remote monitoring, administration and manageability.
10. Support testing of batch models in modular fashion.

Developing a custom batch program to the needs of the project is always the option, but having a consistent model and framework to build batch programs would benefit the organization immensely in the long term. Well, open-source frameworks are maturing today to fill this space with the enterprise. One among them is Spring Batch 2.x framework that is matured enough to be used in enterprise landscape.

Spring batch framework is excellent framework for batch processing framework for system. It provides excellent abstractions to represent batch model of processing in a easy, understandable, modular and reusable fashion. With its latest version Spring Batch 2.x, it has built good capabilities that makes it top ten open source frameworks that are worth using in the enterprise landscape.

Following are positive features to adopt Spring batch framework:
1. It clearly defines separate layers: Infrastructure, Batch Core and Application, with abstractions that are extensible, reusable and testable.
2. Being part of Spring family, it maintains all benefits of using Spring with extended capabilities
3. Provides support for batch state and meta-data persistence that helps in monitoring and restartability.
4. Out-of-box support for variety of ItemReaders, ItemWriters and Tasklets
5. Support for different scalability models : Single process - multithreaded, Partition - Multiprocess - remote processing,
6. Well defined xml based domain language for batch applications that support dependency injection, customization and reuse.
7. Batch data model that supports meta-data persistence and monitoring.
8. Restartability, ability to restart processing from the next data-item from the source. This adds to the resilience of the solution, the much desired features.
9. Variety of job launching options, Command line program launch, programatic launch on message arrival or event.
10. Monitoring and Management of batch jobs for job operator. This supports realtime monitoring.

Spring batch projects applied in following scenarios:

1. Scheduled report generations: Producing end-of-day management reports on a operational data store. Integration with report engine and source data store.
2. Reference data update : Fetching and updating reference data on reference store periodically. XML being data exchange format, processing xml elements as data items.
3. Partitioned event based processing: Batch jobs launched on arrival of messages. In a partitioned and staged design, processing is partioned in stages. This provides resilience and restartability while processing large files.

There are few cases where one needs to extend the support provided in the framework, it is very flexible, being Pojo based, leveraging capabilities from Spring family with AOP, Config, Integration and Context management, it is powerful framework and has wide applicability.

We could see following benefits in using this framework:

1. Reduce risk for the project, due to use of right framework for the job, it  reduces chances of going the wrong path.
2. Reduce time to delivery, due to support at Infrastructure and batch domain abstraction, providing out-of-box reuse of capabilities
3. Large active community base, supports product adoption, enhancement and shared learning.
4. A much needed standard model for batch processing within enterprise that helps reuse, reduced effort to maintain, extensible and provides longer life to the application.
5. Being open-source, reduces cost of adoption, no licensing cost.

Some of the features that could be added to this framework are:

1. Distributed and remote monitoring and management of Batch applications in centralized fashion.
2. Enhanced support for execution environments -  Hadoop and SEDA model,
3. Flexibility in applying multi-threaded processing within a Step.
4. Remote Administration.

It has wide applications provided we leverage its abstraction model and comply with rules of usage with respect to scalability and persistence. Following resources are useful for readers:

Batch processing strategies
Scaling with Grid

No comments:

Post a Comment