Software // Information Management
10:05 PM
Connect Directly

Hadoop Crunches Web-Sized Data

To parse large volumes of data from Web servers, Yahoo and others turn to Hadoop's open source cloud-based analysis system.

As the World Wide Web has exploded into millions of sites and billions of documents, the search engines that purport to know about everything on the Web have faced a gargantuan task. Sure, more spiders can be activated to crawl the Web and collect information. But what system can analyze all the data before the information is out of date?

The answer is a cluster-based analysis system, sometimes referred to loosely as a cloud database system. At the Cloud Computing Conference and Expo Nov. 3 in Santa Clara, Calif., representatives of Yahoo explained how they use Hadoop open source software, from the Apache Software Foundation, to analyze the Web.

Hadoop is a system that can be applied to "big data," or masses of data collected from the Web, such as the crawls that lead to the search indexes. Eric Baldeschwieler, VP of Hadoop software development, leads the largest known active Hadoop development team and said Yahoo is the world's largest Hadoop user. It uses Hadoop on clusters of 4,000 computers to analyze up to 92 petabytes of data stored on disks.

Hadoop builds Yahoo's indexes of the Web that power the Yahoo search engine. Its Web mapping system "runs in 73 hours, taking as input, data from all the Web pages in the world," he said. Yahoo's digest of Web pages consists of 300 terabytes of data. Hadoop analysis tells Yahoo's ad system what ads to serve to visitors, based on their profile from searches they've conducted on the site.

It's use of Hadoop keeps it running on a total of 25,000 servers at the company, he said. Yahoo distributes its tested, production version of Hadoop for free, Baldeschweiler said.

Another speaker at the conference was Christophe Brisciglia, a former Google engineer and now part of the founding team at Cloudera, a firm that is producing a supported enterprise distribution of Hadoop. "Cloudera is to Hadoop as Red Hat is to Linux," he said.

Brisciglia described Hadoop as "a batch data processing system" for use on clusters of commodity hardware. Unlike relational database, "in Hadoop there is no structure (to the data). You can dump incredibly large amounts of data into a Hadoop cluster and figure out what to do with it later."

1 of 2
Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Elite 100 Digital Issue, April 2015
The 27th annual ranking of the leading US users of business technology
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on for the week of April 19, 2015.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.