Software // Information Management
Commentary
10/2/2009
04:31 PM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Hadoop and the Big-Data Revolution

There's a revolution underway in the use of big data, and Hadoop, the open-source distributed computing system, is at the center of it. Apache Hadoop success stories and accolades were shared today by the likes of Yahoo!, Facebook, eHarmony, IBM and JP Morgan Chase at Hadoop World in New York City. Here's a sampling of highlights...

There's a revolution underway in the use of big data, and Hadoop, the open-source distributed computing system, is at the center of it. Apache Hadoop is most often associated with MapReduce data processing, but it also includes a distributed file system and subprojects including the Hive data warehouse. All of the above were at the subject of success stories, accolades and palpable excitement at today's Hadoop World in New York City. Executives from Yahoo!, Facebook, eHarmony, IBM and JP Morgan Chase were here offering insight into how Hadoop is changing expectations for analysis of big data.

Sharing a few highlights from today's presentations, here's what these organizations are doing with Hadoop:

  • Yahoo!, by far the largest developer and contributor to Hadoop, uses it to analyze and improve content optimization, spam filtering, search indexing and ad optimization. Yahoo! has a 4,000-node cluster with 16 petabytes of disk space available for Hadoop analysis, and it has used this infrastructure to sort 1 petabyte of data in 16 hours (across 3,700 nodes) and 1 terabyte of data in 62 seconds (across 1,500 nodes).
  • Facebook is using Hadoop to help analyze the 4 terabytes of compressed new data added to the social networking site each day. Facebook's Hive-based data warehouse runs 7,500 jobs per day for a total of more than 80,000 compute hours. Reporting is a key task, with daily and weekly aggregations of impressions and click counts across the site. Results are reported and explored though MicroStrategy dashboards.

  • eHarmony, the online dating service, is using Hadoop processing and the Hive data warehouse to better understand and more accurately match people among its 20 million registered users.

  • IBM's Emerging Technologies unit has used Hadoop for an experimental mergers-and-acquisitions due-diligence engine. The project compared 1.4 million patent records against fourteen years' worth of Court of Appeals records to spot legal challenges on intellectual property ownership. IBM said the engine has performed in 5 minutes what would otherwise take teams of legal researchers a week to compile.

  • JP Morgan Chase presented here today describing proof-of-concept data warehousing projects that are pursuing "order of magnitude savings" using open-source Hadoop and commodity hardware rather than conventional relational databases and SMP hardware.

The Hadoop World event was presented by Cloudera, a software and professional services firm focused exclusively on Hadoop. The firm announced Cloudera Desktop, a new Web-based, user-friendlier (though still programmer-oriented) interface for Hadoop applications. The Desktop can be used with on-premise implementations of Hadoop or cloud-based instances hosted on Amazon EC2. Amazon executives were also on hand today to discuss use of Amazon Elastic MapReduce, which is a Web services-based implementation built on the Hadoop framework. Amazon announced a partnership whereby customers can specify Cloudera instances within Amazon Elastic MapReduce in order to secure that vendor's professional services and support.

Cloudera founder Christophe Bisciglia opened the day saying that Hadoop is fast becoming pervasive and an increasingly obvious choice not just for Web companies but for all types of companies with big-data challenges and opportunities. Judging by the enthusiasm and numbers of attendees here today (surpassing 500), the big-data revolution has swept out of Silicon Valley and is reaching mainstream corporate data centers.There's a revolution underway in the use of big data, and Hadoop, the open-source distributed computing system, is at the center of it. Apache Hadoop success stories and accolades were shared today by the likes of Yahoo!, Facebook, eHarmony, IBM and JP Morgan Chase at Hadoop World in New York City. Here's a sampling of highlights...

Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.