5 min read

Hadoop Spurs Big Data Revolution

Open source data processing platform has won over Web giants for its low cost, scalability, and flexibility. Now Hadoop will make its way into more enterprises.
What's Ahead?

Companies already using Hadoop invariably have bigger plans. AOL is moving critical applications to its 700-node production environment, which is described as a highly reliable and controlled deployment, providing data down to granular levels of detail. The 300-node R&D environment is where many of company's most advanced Ph.D. analytics experts work on cutting-edge projects. Cloudera provides the enterprise support for both deployments, helping AOL with bug fixes, software upgrades, and service problems.

At ComScore, it will be several months before Hadoop can scale up and replace its data processing grid, Brown says. That move was delayed in part because ComScore switched from Cloudera's Hadoop distribution to MapR's, which ComScore licensed through EMC Greenplum. MapR's version of Hadoop will let ComScore switch from HDFS to the more mature and widely used Network File System. NFS will enable the company to easily move data back and forth among Hadoop, Sybase IQ, and other data sources and systems, something it couldn't do with HDFS, Brown says.

EMC and partner MapR introduced new Hadoop software and support options this spring, as did IBM with its BigInsights offering. IBM partner Karmasphere, which provides Hadoop development and analytics tools, recently introduced a virtual appliance for BigInsights, designed to speed development of MapReduce jobs and related analytics projects. Microsoft has promised a Windows Server-friendly distribution of Hadoop supported by Yahoo spin-off Hortonworks, another enterprise-focused Hadoop tools and support provider. It's a safe bet that Oracle, too, will find ways to differentiate its Hadoop offering beyond the promised delivery of the Oracle Big Data Appliance.

Only the largest vendors have had the chutzpa to announce their own Hadoop software distributions and support plans. But dozens of others have added integrations and support tools, so they can move data into and out of Hadoop and analyze data sets after they're boiled down by MapReduce processing. That list includes data warehouse vendors Hewlett-Packard, ParAccel, and Teradata; data integration vendors Informatica, Pervasive, Talend, and Syncsort; and business intelligence and analytics vendors Jaspersoft, Pentaho, and SAS.

The latest wave of Hadoop announcements is coming from application developers and service providers. Amazon has offered a Hadoop-based service on its Elastic Compute Cloud since 2009. IBM launched a BigInsights service on its SmartCloud Enterprise platform in October. And Microsoft is promising a beta Hadoop-based service on the SQL Azure cloud platform by year's end.

Hadoop's Many Pieces
Hadoop Subprojects
Hadoop Common Common utilities that support the other Hadoop subprojects
Hadoop Distributed File System Distributed file system that provides high-throughput access to application data
Hadoop MapReduce Software framework for distributed processing of large data sets on compute clusters
Other Hadoop-Related Apache Projects
Chukwa Data-collection system for managing large distributed systems
HBase Scalable, distributed database that supports structured data storage for large tables
Hive Data warehouse infrastructure that provides data summarization and ad hoc querying
Mahout Scalable machine learning and data mining library
Pig High-level data-flow language and execution framework for parallel computing
ZooKeeper High-performance coordination service for distributed applications
Data: Apache Software Foundation
SunGard plans to launch a Hadoop-based managed service that will let customers run MapReduce jobs. No word on when, but CTO Indu Kodukula says the company will run MapR software on EMC Greenplum's modular appliance. It will aim the service at customers that expect to operate 100 TB or more of data but aren't ready to commit to building out their own infrastructure to support Hadoop.

"Most of the requests that we've received to support Hadoop come from large financial customers that have an enormous amount of data and interest in blending in external sources, but they don't entirely know whether the results are going to be meaningful," Kodukula says. Rather than spending first and risking failure, they'd rather experiment with a managed service, he says.

On the apps front, Tidemark introduced an innovative cloud-based performance management application in October built on an "elastic computation grid based on in-memory technology coupled with Hadoop MapReduce processing." That's a mouthful, but it's simpler than it sounds. The in-memory technology is used for the fast analyses you expect in a performance management app (think Cognos TM1, QlikTech, SAP Hana, and Tibco Spotfire-style financial analyses delivered via the cloud). The Hadoop MapReduce part speeds answers to big data problems and blends mixed data types that might not conform to a fixed schema.

Tidemark customer U.S. Sugar, for example, is mixing weather data with the information it gets from growers related to seeds, chemical treatments, and acres planted to better understand and predict crop production. And Acosta, a marketing services firm that works with consumer products companies, is analyzing consumer sentiments expressed in social media to do a better job of stocking products in support of marketing campaigns.

All this support for Hadoop will naturally encourage broader experimentation and is likely to boost adoption. According to a recent InformationWeek survey of 431 business technology professionals involved with information management tools, only about 3% have made extensive use of Hadoop or other NoSQL platforms while 11% have made limited use of it (see chart, below). With all the hype around Hadoop, those figures should begin to rise.

Chart Limited Hadoop Use --So Far

It may be that we're at the apex of Gartner's hype cycle, so beware the trough of disillusionment in the months ahead. For one thing, expect a cacophony of confusing commercial messages. Customer success stories and emerging applications will be the best way to guage Hadoop's progress.

Once Hadoop is proven and mission critical, as it is at AOL, its use will be as routine and accepted as SQL and relational databases are today. It's the right tool for the job when scalability, flexibility, and affordability really matter. That's what all the Hadoopla is about.

Read the sidebar:
Hadoop's Flexibility Wins Over Online Data Provider