Yahoo Claims Record With Petabyte Database - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management
08:04 PM
Connect Directly

Yahoo Claims Record With Petabyte Database

Yahoo claims it has the largest SQL database in a production environment and that it will grow larger.

Several years ago, Google and Yahoo fought for bragging rights about which company had the biggest Web index. Google put an end to that game in 2005 when it declared that its index was three times larger than Yahoo's. After that, the debate shifted to search relevance.

Yahoo now is seeking recognition for a different accomplishment: The embattled search company and community portal claims that it has the largest SQL database in a production environment.

"This is the first time, that we know of, that someone has put a one petabyte-plus database into production," said Waqar Hasan, VP of data at Yahoo. "We have built it to scale to tens of petabytes and we intend to get there. Come 2009, we'll be at multiple tens of petabytes."

A petabyte equals one thousand terabytes, one million gigabytes, or 1 trillion megabytes. It's an uncommon enough measurement that the word "petabyte" is not yet recognized by Microsoft Word 2007's spell checker.

"The amount of data that we get is much more than the traditional industry and even in the Internet space is significantly more than other large players," said Hasan. The reason for this, he explained, is that consumers spend twice as long on Yahoo as they do at Google and three times as long on Yahoo as they do at Microsoft's sites. (This, in part, explains Microsoft's interest in acquiring Yahoo.)

The data Yahoo gathers is structured data, as opposed to unstructured data like e-mail and other documents. "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective," said Hasan.

Yahoo uses this data to deliver what it hopes will be the best possible experience for its consumers, through personalization, and the most profitable experience for its advertisers, through ad targeting. "Fundamentally, what this is enabling is what we call deep analytics," said Hasan. "Doing deep analytics with a low entry barrier is really what this technology enables."

Yahoo's database is built out of commodity Intel boxes, strung together in large clusters. "The classic industry approach has been to go for big SMP [symmetric multiprocessing] boxes," Hasan explained. "We started from the ground up with the premise that all you get to use is commodity hardware and you get to take lots of little boxes and put them together."

Yahoo's database technology came out of work begun at Mahat Technologies, a Seattle-based start-up that Yahoo quietly acquired in November 2005 for an undisclosed sum.

Yahoo started with the PostgreSQL engine and replaced the query processing layer with code designed for its commodity hardware cluster.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
Get Your Enterprise Ready for 5G
Mary E. Shacklett, Mary E. Shacklett,  1/14/2020
Modern App Dev: An Enterprise Guide
Cathleen Gagne, Managing Editor, InformationWeek,  1/5/2020
9 Ways to Improve IT and Operational Efficiencies in 2020
Cynthia Harvey, Freelance Journalist, InformationWeek,  1/2/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Flash Poll