Talend Takes on High-Volume Data Integration

MPx suite incorporates MapReduce architecture and parallel processing to handle up to one million records per second.

Sandy Kemsley, Contributor

May 19, 2009

2 Min Read

It's not every company that needs to handle data integration at speeds of up to one million records per second. But there are more than a few telcos, financial services firms, retailers, medical researchers and others that handle super-high volumes of data every day. Enter Talend Integration Suite MPx, a new, highly scalable data integration product incorporating MapReduce architecture and massively parallel processing to extract, transform and load vast data sets within tight time constraints.

Based on the open-source Talend Integration Suite, MPx adds two sets of features to support extreme scalability. First, Talend's FileScale technology is said to use a MapReduce architecture to perform arithmetic functions and sort, filter, merge, aggregate and transform data with optimized performance on supported hardware platforms.

"FileScale lets you take advantage of the entire [hardware] stack -- multi-CPU architectures and multicore processors -- to execute extremely fast operations on data sets," says Yves de Montcheuil, Talend's vice president of marketing. "The product also uses MapReduce, which is the technology Google uses to process Internet search queries very rapidly. The sorting, aggregation, calculation and transformation of data [during integration] are not that different than the processing Google does on Web page indexes."

MPx is also said to employ multiple levels of massive parallelization to break down data sets into many parallel-processing streams while also exploiting parallel database loaders.

"Once you've processed the data extremely quickly and need to load into, say, Teradata or Oracle, MPx lets you take advantage of their multithreaded loaders," de Montcheuil explains.

MPx will compete with high-end data-integration vendors including Ab Initio, with its Co>Operating System, and IBM, which offers DataStage PX. In contrast to MPx, which can process integrations developed on the standard Talend platform, DataStage PX is not compatible with conventional DataStage integration routines, de Montcheuil asserts.

MPx-supported hardware platforms include 32-bit and 64-bit Windows servers, Solaris and OpenSolaris (SPARC and Intel x86), IBM AIX, HP-UX and 32-bit and 64-bit Linux servers. Benchmark tests performed on a high-end but far-from-exotic Sun Blade X6270 server featuring two Xeon 5520 quad-core processors at 2.26 GHz and 24 GB of RAM reportedly yielded impressive performance levels. Sorting, aggregation and averaging speeds ranged from 200,000 to 400,000 records per second when accessing data from disk and up to one million records per second when processing data in memory.

"These speeds were achieved on a single server with a dual CPU, and it was done with standard MPx software that was not fine-tuned for the data processed," de Montcheuil points out.

There are more than 250,000 users of vendor's open-source integration software and 500 licensed corporate customers (with thousands of users) of the commercially supported software, according to Talend.

Talend Integration Suite MPx is available immediately and is said to cost about $100,000 for a typical deployment. Pricing depends on the number of users.

About the Author(s)

Sandy Kemsley


is a systems architect and analyst who specializes in BPM and Enterprise 2.0.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights