Software // Information Management
Commentary
5/12/2011
04:57 PM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

4 Hadoop Helpers Promise Speedy Big-Data Analysis

Integration vendors launch commercial add-on products for the hot open-source framework. Here's how four products streamline high-volume workloads.

Apache Hadoop is one of the fastest-growing open-source projects going, so it's no surprise that commercial vendors are looking for a piece of the action.

Witness a spate of recent announcements from well-known data-integration vendors including Informatica, Pervasive Software, SnapLogic, and Syncsort, all of which are aimed at making it faster and easier to work with a very young big-data processing platform.

To recap, Hadoop is a collection of distributed data-processing components for analyzing large volumes of unstructured data, such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs. Relational databases, such as IBM DB2, Oracle, Microsoft SQL Server, and MySQL can't handle this data because it doesn't fit neatly into columns and rows.

Even if these commercial databases could do the job, the cost of the licenses would be prohibitive because due to the scale of the data. We're generally talking about hundreds of terabytes, and into the petabytes.

As an open-source project, Hadoop software distributions can be downloaded for free, and the software is designed to scale out on low-cost commodity servers. There aren't legions of companies that need Hadoop, but the capabilities and economies have attracted outfits including AOL, eHarmony, eBay, Facebook, JP Morgan Chase, LinkedIN, Netflix, The New York Times, and Twitter.

Hadoop is getting to be a magnet for commercial vendors. Cloudera offers a popular distribution of Hadoop and it's the leading provider of enterprise support and services. Datameer offers supporting data-integration, storage, analytics and visualization software, and Karmasphere adds a graphical environment for development, debugging and monitoring Hadoop jobs.

EMC announced Monday that it will offer its own distributions of Hadoop software, one open-source and a commercial enterprise edition including proprietary components. As I covered in my last column, EMC also announced an appliance capable of running the EMC Greenplum relational database and Hadoop on a single hardware platform.

Informatica and SnapLogic

Data-integration vendors Informatica and SnapLogic both announced partner announcements with EMC this week. Informatica says it will integrate its data-integration platform with the EMC Hadoop distributions, which are set for release in the third quarter. Informatica previously partnered with Cloudera on a similar integration.

Informatica is the largest independent data-integration vendor out there, with more than 4,200 customer firms, so EMC and Cloudera need Informatica every bit as much as Informatica wants big-data-crunching Hadoop users.

SnapLogic announced SnapReduce, a module for the SnapLogic platform that will pipe data into MapReduce, the core Hadoop data-filtering algorithm. SnapLogic will also introduce its own version of the Hadoop Distributed File System (HDFS); that will let Hadoop users pull data from the many sources handled by the SnapLogic platform and to go the other way, too. Both products are expected in the second half of this year.

I've previously reported on Hadoop-supporting tools from Talend, an open-source data-integration vendor, and from Quest Software. Most integration partnerships are aimed at making it easier to get data into and out of Hadoop. In the case of Syncsort and Pervasive, commercial add-on products are aimed at speeding processing within Hadoop.

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Must Reads Oct. 21, 2014
InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A roundup of the top stories and trends on InformationWeek.com
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.