As the World Wide Web has exploded into millions of sites and billions of documents, the search engines that purport to know about everything on the Web have faced a gargantuan task. Sure, more spiders can be activated to crawl the Web and collect information. But what system can analyze all the data before the information is out of date?
The answer is a cluster-based analysis system, sometimes referred to loosely as a cloud database system. At the Cloud Computing Conference and Expo Nov. 3 in Santa Clara, Calif., representatives of Yahoo explained how they use Hadoop open source software, from the Apache Software Foundation, to analyze the Web.
Hadoop is a system that can be applied to "big data," or masses of data collected from the Web, such as the crawls that lead to the search indexes. Eric Baldeschwieler, VP of Hadoop software development, leads the largest known active Hadoop development team and said Yahoo is the world's largest Hadoop user. It uses Hadoop on clusters of 4,000 computers to analyze up to 92 petabytes of data stored on disks.
Hadoop builds Yahoo's indexes of the Web that power the Yahoo search engine. Its Web mapping system "runs in 73 hours, taking as input, data from all the Web pages in the world," he said. Yahoo's digest of Web pages consists of 300 terabytes of data. Hadoop analysis tells Yahoo's ad system what ads to serve to visitors, based on their profile from searches they've conducted on the site.
It's use of Hadoop keeps it running on a total of 25,000 servers at the company, he said. Yahoo distributes its tested, production version of Hadoop for free, Baldeschweiler said.
Another speaker at the conference was Christophe Brisciglia, a former Google engineer and now part of the founding team at Cloudera, a firm that is producing a supported enterprise distribution of Hadoop. "Cloudera is to Hadoop as Red Hat is to Linux," he said.
Brisciglia described Hadoop as "a batch data processing system" for use on clusters of commodity hardware. Unlike relational database, "in Hadoop there is no structure (to the data). You can dump incredibly large amounts of data into a Hadoop cluster and figure out what to do with it later."