Integration vendors launch commercial add-on products for the hot open-source framework. Here's how four products streamline high-volume workloads.
Apache Hadoop is one of the fastest-growing open-source projects going, so it's no surprise that commercial vendors are looking for a piece of the action.
Witness a spate of recent announcements from well-known data-integration vendors including Informatica, Pervasive Software, SnapLogic, and Syncsort, all of which are aimed at making it faster and easier to work with a very young big-data processing platform.
To recap, Hadoop is a collection of distributed data-processing components for analyzing large volumes of unstructured data, such as Facebook comments and Twitter tweets, email and instant messages, and security and application logs. Relational databases, such as IBM DB2, Oracle, Microsoft SQL Server, and MySQL can't handle this data because it doesn't fit neatly into columns and rows.
Even if these commercial databases could do the job, the cost of the licenses would be prohibitive because due to the scale of the data. We're generally talking about hundreds of terabytes, and into the petabytes.
As an open-source project, Hadoop software distributions can be downloaded for free, and the software is designed to scale out on low-cost commodity servers. There aren't legions of companies that need Hadoop, but the capabilities and economies have attracted outfits including AOL, eHarmony, eBay, Facebook, JP Morgan Chase, LinkedIN, Netflix, The New York Times, and Twitter.
Hadoop is getting to be a magnet for commercial vendors. Cloudera offers a popular distribution of Hadoop and it's the leading provider of enterprise support and services. Datameer offers supporting data-integration, storage, analytics and visualization software, and Karmasphere adds a graphical environment for development, debugging and monitoring Hadoop jobs.
EMC announced Monday that it will offer its own distributions of Hadoop software, one open-source and a commercial enterprise edition including proprietary components. As I covered in my last column, EMC also announced an appliance capable of running the EMC Greenplum relational database and Hadoop on a single hardware platform.
Informatica and SnapLogic
Data-integration vendors Informatica and SnapLogic both announced partner announcements with EMC this week. Informatica says it will integrate its data-integration platform with the EMC Hadoop distributions, which are set for release in the third quarter. Informatica previously partnered with Cloudera on a similar integration.
Informatica is the largest independent data-integration vendor out there, with more than 4,200 customer firms, so EMC and Cloudera need Informatica every bit as much as Informatica wants big-data-crunching Hadoop users.
SnapLogic announced SnapReduce, a module for the SnapLogic platform that will pipe data into MapReduce, the core Hadoop data-filtering algorithm. SnapLogic will also introduce its own version of the Hadoop Distributed File System (HDFS); that will let Hadoop users pull data from the many sources handled by the SnapLogic platform and to go the other way, too. Both products are expected in the second half of this year.
I've previously reported on Hadoop-supporting tools from Talend, an open-source data-integration vendor, and from Quest Software. Most integration partnerships are aimed at making it easier to get data into and out of Hadoop. In the case of Syncsort and Pervasive, commercial add-on products are aimed at speeding processing within Hadoop.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
InformationWeek Must Reads Oct. 21, 2014InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.