Software // Information Management
News
5/20/2008
08:04 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Yahoo Claims Record With Petabyte Database

Yahoo claims it has the largest SQL database in a production environment and that it will grow larger.

Several years ago, Google and Yahoo fought for bragging rights about which company had the biggest Web index. Google put an end to that game in 2005 when it declared that its index was three times larger than Yahoo's. After that, the debate shifted to search relevance.

Yahoo now is seeking recognition for a different accomplishment: The embattled search company and community portal claims that it has the largest SQL database in a production environment.

"This is the first time, that we know of, that someone has put a one petabyte-plus database into production," said Waqar Hasan, VP of data at Yahoo. "We have built it to scale to tens of petabytes and we intend to get there. Come 2009, we'll be at multiple tens of petabytes."

A petabyte equals one thousand terabytes, one million gigabytes, or 1 trillion megabytes. It's an uncommon enough measurement that the word "petabyte" is not yet recognized by Microsoft Word 2007's spell checker.

"The amount of data that we get is much more than the traditional industry and even in the Internet space is significantly more than other large players," said Hasan. The reason for this, he explained, is that consumers spend twice as long on Yahoo as they do at Google and three times as long on Yahoo as they do at Microsoft's sites. (This, in part, explains Microsoft's interest in acquiring Yahoo.)

The data Yahoo gathers is structured data, as opposed to unstructured data like e-mail and other documents. "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective," said Hasan.

Yahoo uses this data to deliver what it hopes will be the best possible experience for its consumers, through personalization, and the most profitable experience for its advertisers, through ad targeting. "Fundamentally, what this is enabling is what we call deep analytics," said Hasan. "Doing deep analytics with a low entry barrier is really what this technology enables."

Yahoo's database is built out of commodity Intel boxes, strung together in large clusters. "The classic industry approach has been to go for big SMP [symmetric multiprocessing] boxes," Hasan explained. "We started from the ground up with the premise that all you get to use is commodity hardware and you get to take lots of little boxes and put them together."

Yahoo's database technology came out of work begun at Mahat Technologies, a Seattle-based start-up that Yahoo quietly acquired in November 2005 for an undisclosed sum.

Yahoo started with the PostgreSQL engine and replaced the query processing layer with code designed for its commodity hardware cluster.

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest Septermber 14, 2014
It doesn't matter whether your e-commerce D-Day is Black Friday, tax day, or some random Thursday when a post goes viral. Your websites need to be ready.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.