Several years ago, Google and Yahoo fought for bragging rights about which company had the biggest Web index. Google put an end to that game in 2005 when it declared that its index was three times larger than Yahoo's. After that, the debate shifted to search relevance.
"This is the first time, that we know of, that someone has put a one petabyte-plus database into production," said Waqar Hasan, VP of data at Yahoo. "We have built it to scale to tens of petabytes and we intend to get there. Come 2009, we'll be at multiple tens of petabytes."
A petabyte equals one thousand terabytes, one million gigabytes, or 1 trillion megabytes. It's an uncommon enough measurement that the word "petabyte" is not yet recognized by Microsoft Word 2007's spell checker.
"The amount of data that we get is much more than the traditional industry and even in the Internet space is significantly more than other large players," said Hasan. The reason for this, he explained, is that consumers spend twice as long on Yahoo as they do at Google and three times as long on Yahoo as they do at Microsoft's sites. (This, in part, explains Microsoft's interest in acquiring Yahoo.)
The data Yahoo gathers is structured data, as opposed to unstructured data like e-mail and other documents. "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective," said Hasan.
Yahoo uses this data to deliver what it hopes will be the best possible experience for its consumers, through personalization, and the most profitable experience for its advertisers, through ad targeting. "Fundamentally, what this is enabling is what we call deep analytics," said Hasan. "Doing deep analytics with a low entry barrier is really what this technology enables."
Yahoo's database is built out of commodity Intel boxes, strung together in large clusters. "The classic industry approach has been to go for big SMP [symmetric multiprocessing] boxes," Hasan explained. "We started from the ground up with the premise that all you get to use is commodity hardware and you get to take lots of little boxes and put them together."
Yahoo's database technology came out of work begun at Mahat Technologies, a Seattle-based start-up that Yahoo quietly acquired in November 2005 for an undisclosed sum.
Yahoo started with the PostgreSQL engine and replaced the query processing layer with code designed for its commodity hardware cluster.