East Coast event highlights growing, mainstream adoption of open-source software designed for terabyte- to petabyte-scale data processing.

Doug Henschen, Executive Editor, Enterprise Apps

October 5, 2009

3 Min Read

"Our storage footprint tripled between 2007 and 2009... so why wouldn't we consider Hadoop?"

This testimony, shared by Sih Lee of JP Morgan Chase, pretty much sums up the running theme at last week's Hadoop World New York City. We're entering a petabyte era, so organizations of all kinds are looking for new alternatives to handle the 'big data' data processing challenges. (See the influencers and read what they're saying in our accompanying "Hadoop World NYC Image Gallery.")

Hadoop is an open-source software project that was originally based on MapReduce processing principles articulated in a Google white paper published in 2004. The project has since flourished and expanded beyond MapReduce to add subprojects, including the Hadoop Distributed File System (HDFS); Pig data flow language; the HBase distributed, column-oriented database; and the Hive distributed data warehouse.

Web-based companies have led Hadoop adoption, and Yahoo!, Amazon, Facebook and eHarmony executives were on hand at Hadoop World NYC to extol the software's virtues and share details of their deployments. The key point of the event, however, was to highlight and encourage mainstream adoption.

"Hadoop is now everywhere and it's not just for Web companies, it's for all types of companies," stressed Christophe Bisciglia, founder of Cloudera, the Hadoop-focused professional services firm that organized the event.

The testimony of JP Morgan Chase’s Lee helped prove Bisciglia's point about mainstream corporate adoption. Lee, a vice president responsible for "Firmwide Innovation & Shared Services Strategy," said the firm has been exploring Hadoop for more than 18 months. It now has several proof-of-concept projects in the pipeline, seeking cost efficiencies over conventional technologies such as storage area networks, network-attached storage and symmetric multiprocessor hardware.

"Hadoop gives us a cost proposition that is an order of magnitude more cost efficient than some of the competing technologies," he said. "Another driver for considering Hadoop is choice... Having a single-vendor technology lock-in does not help us form a sound strategy overall. The ability to embrace a new technology such as Hadoop gives us another option from which to make sound decisions and choices."

Lee positioned MapReduce and the Hadoop Distributed File System generically as an alternative for petabyte-scale, relatively high-latency data processing, though he declined to detail specific applications at the financial services firm. Offering much more information, Facebook described its Hive-based data warehouse implementation in detail and eHarmony discussed the advantages of cloud-based MapReduce processing in preparation for internal data warehouse analysis.

Cloudera describes Hadoop as a complement to, rather than a replacement of existing systems: Hadoop is not a database nor does it need to replace any existing data systems you may have. Hadoop augments these systems by offloading the particularly difficult problem of simultaneously ingesting, processing and delivering/exporting large volumes of data so existing systems can focus on what they were designed to do, whether that is serving real-time transactional data or providing interactive business intelligence.

Many Hadoop instances (and certainly most of the largest scale Hadoop instances) are built on homegrown implementations of commodity hardware. A few commercial vendors have embraced Hadoop. Aster Data Systems, for instance, supports both SQL- and Hadoop-based MapReduce, and last week it introduced a connector for separate Hadoop instances (built on Aster or other platforms). Vertica also has a connector for Hadoop-based MapReduce implementations.

Amazon has brought Hadoop-based MapReduce to the cloud through its Elastic MapReduce Web service on EC2, and last week it added support for the Hadoop Hive distributed data warehouse.

Judging by the strong attendance at the event, with some 500 developers and advocates in attendance, it's clear that Hadoop is part of a disruptive wave of technologies emerging for big data problems, and mainframes, conventional storage systems and proprietary data management software will see the brunt of the impact.

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights