Software // Information Management
News
5/19/2009
03:40 PM
Connect Directly
RSS
E-Mail
50%
50%
Repost This

Talend Takes on High-Volume Data Integration

MPx suite incorporates MapReduce architecture and parallel processing to handle up to one million records per second.

It's not every company that needs to handle data integration at speeds of up to one million records per second. But there are more than a few telcos, financial services firms, retailers, medical researchers and others that handle super-high volumes of data every day. Enter Talend Integration Suite MPx, a new, highly scalable data integration product incorporating MapReduce architecture and massively parallel processing to extract, transform and load vast data sets within tight time constraints.

Based on the open-source Talend Integration Suite, MPx adds two sets of features to support extreme scalability. First, Talend's FileScale technology is said to use a MapReduce architecture to perform arithmetic functions and sort, filter, merge, aggregate and transform data with optimized performance on supported hardware platforms.

"FileScale lets you take advantage of the entire [hardware] stack -- multi-CPU architectures and multicore processors -- to execute extremely fast operations on data sets," says Yves de Montcheuil, Talend's vice president of marketing. "The product also uses MapReduce, which is the technology Google uses to process Internet search queries very rapidly. The sorting, aggregation, calculation and transformation of data [during integration] are not that different than the processing Google does on Web page indexes."

MPx is also said to employ multiple levels of massive parallelization to break down data sets into many parallel-processing streams while also exploiting parallel database loaders.

"Once you've processed the data extremely quickly and need to load into, say, Teradata or Oracle, MPx lets you take advantage of their multithreaded loaders," de Montcheuil explains.

MPx will compete with high-end data-integration vendors including Ab Initio, with its Co>Operating System, and IBM, which offers DataStage PX. In contrast to MPx, which can process integrations developed on the standard Talend platform, DataStage PX is not compatible with conventional DataStage integration routines, de Montcheuil asserts.

MPx-supported hardware platforms include 32-bit and 64-bit Windows servers, Solaris and OpenSolaris (SPARC and Intel x86), IBM AIX, HP-UX and 32-bit and 64-bit Linux servers. Benchmark tests performed on a high-end but far-from-exotic Sun Blade X6270 server featuring two Xeon 5520 quad-core processors at 2.26 GHz and 24 GB of RAM reportedly yielded impressive performance levels. Sorting, aggregation and averaging speeds ranged from 200,000 to 400,000 records per second when accessing data from disk and up to one million records per second when processing data in memory.

"These speeds were achieved on a single server with a dual CPU, and it was done with standard MPx software that was not fine-tuned for the data processed," de Montcheuil points out.

There are more than 250,000 users of vendor's open-source integration software and 500 licensed corporate customers (with thousands of users) of the commercially supported software, according to Talend.

Talend Integration Suite MPx is available immediately and is said to cost about $100,000 for a typical deployment. Pricing depends on the number of users.

Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
Video
Slideshows
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.