Software // Information Management
News
10/5/2009
01:20 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%
Repost This

Hadoop World NYC Highlights Budding Alternative for Big Data

East Coast event highlights growing, mainstream adoption of open-source software designed for terabyte- to petabyte-scale data processing.

"Our storage footprint tripled between 2007 and 2009... so why wouldn't we consider Hadoop?"

This testimony, shared by Sih Lee of JP Morgan Chase, pretty much sums up the running theme at last week's Hadoop World New York City. We're entering a petabyte era, so organizations of all kinds are looking for new alternatives to handle the 'big data' data processing challenges. (See the influencers and read what they're saying in our accompanying "Hadoop World NYC Image Gallery.")

Hadoop is an open-source software project that was originally based on MapReduce processing principles articulated in a Google white paper published in 2004. The project has since flourished and expanded beyond MapReduce to add subprojects, including the Hadoop Distributed File System (HDFS); Pig data flow language; the HBase distributed, column-oriented database; and the Hive distributed data warehouse.

Web-based companies have led Hadoop adoption, and Yahoo!, Amazon, Facebook and eHarmony executives were on hand at Hadoop World NYC to extol the software's virtues and share details of their deployments. The key point of the event, however, was to highlight and encourage mainstream adoption.

"Hadoop is now everywhere and it's not just for Web companies, it's for all types of companies," stressed Christophe Bisciglia, founder of Cloudera, the Hadoop-focused professional services firm that organized the event.

The testimony of JP Morgan Chase’s Lee helped prove Bisciglia's point about mainstream corporate adoption. Lee, a vice president responsible for "Firmwide Innovation & Shared Services Strategy," said the firm has been exploring Hadoop for more than 18 months. It now has several proof-of-concept projects in the pipeline, seeking cost efficiencies over conventional technologies such as storage area networks, network-attached storage and symmetric multiprocessor hardware.

"Hadoop gives us a cost proposition that is an order of magnitude more cost efficient than some of the competing technologies," he said. "Another driver for considering Hadoop is choice... Having a single-vendor technology lock-in does not help us form a sound strategy overall. The ability to embrace a new technology such as Hadoop gives us another option from which to make sound decisions and choices."

Lee positioned MapReduce and the Hadoop Distributed File System generically as an alternative for petabyte-scale, relatively high-latency data processing, though he declined to detail specific applications at the financial services firm. Offering much more information, Facebook described its Hive-based data warehouse implementation in detail and eHarmony discussed the advantages of cloud-based MapReduce processing in preparation for internal data warehouse analysis.

Cloudera describes Hadoop as a complement to, rather than a replacement of existing systems:

Hadoop is not a database nor does it need to replace any existing data systems you may have. Hadoop augments these systems by offloading the particularly difficult problem of simultaneously ingesting, processing and delivering/exporting large volumes of data so existing systems can focus on what they were designed to do, whether that is serving real-time transactional data or providing interactive business intelligence.

Many Hadoop instances (and certainly most of the largest scale Hadoop instances) are built on homegrown implementations of commodity hardware. A few commercial vendors have embraced Hadoop. Aster Data Systems, for instance, supports both SQL- and Hadoop-based MapReduce, and last week it introduced a connector for separate Hadoop instances (built on Aster or other platforms). Vertica also has a connector for Hadoop-based MapReduce implementations.

Amazon has brought Hadoop-based MapReduce to the cloud through its Elastic MapReduce Web service on EC2, and last week it added support for the Hadoop Hive distributed data warehouse.

Judging by the strong attendance at the event, with some 500 developers and advocates in attendance, it's clear that Hadoop is part of a disruptive wave of technologies emerging for big data problems, and mainframes, conventional storage systems and proprietary data management software will see the brunt of the impact.

Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Government, May 2014
Protecting Critical Infrastructure: A New Approach NIST's cyber-security framework gives critical-infrastructure operators a new tool to assess readiness. But will operators put this voluntary framework to work?
Video
Slideshows
Twitter Feed
Audio Interviews
Archived Audio Interviews
GE is a leader in combining connected devices and advanced analytics in pursuit of practical goals like less downtime, lower operating costs, and higher throughput. At GIO Power & Water, CIO Jim Fowler is part of the team exploring how to apply these techniques to some of the world's essential infrastructure, from power plants to water treatment systems. Join us, and bring your questions, as we talk about what's ahead.