Yahoo Claims Record With Petabyte Database

Yahoo claims it has the largest SQL database in a production environment and that it will grow larger.

Thomas Claburn, Editor at Large, Enterprise Mobility

May 20, 2008

4 Min Read

Several years ago, Google and Yahoo fought for bragging rights about which company had the biggest Web index. Google put an end to that game in 2005 when it declared that its index was three times larger than Yahoo's. After that, the debate shifted to search relevance.

Yahoo now is seeking recognition for a different accomplishment: The embattled search company and community portal claims that it has the largest SQL database in a production environment.

"This is the first time, that we know of, that someone has put a one petabyte-plus database into production," said Waqar Hasan, VP of data at Yahoo. "We have built it to scale to tens of petabytes and we intend to get there. Come 2009, we'll be at multiple tens of petabytes."

A petabyte equals one thousand terabytes, one million gigabytes, or 1 trillion megabytes. It's an uncommon enough measurement that the word "petabyte" is not yet recognized by Microsoft Word 2007's spell checker.

"The amount of data that we get is much more than the traditional industry and even in the Internet space is significantly more than other large players," said Hasan. The reason for this, he explained, is that consumers spend twice as long on Yahoo as they do at Google and three times as long on Yahoo as they do at Microsoft's sites. (This, in part, explains Microsoft's interest in acquiring Yahoo.)

The data Yahoo gathers is structured data, as opposed to unstructured data like e-mail and other documents. "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective," said Hasan.

Yahoo uses this data to deliver what it hopes will be the best possible experience for its consumers, through personalization, and the most profitable experience for its advertisers, through ad targeting. "Fundamentally, what this is enabling is what we call deep analytics," said Hasan. "Doing deep analytics with a low entry barrier is really what this technology enables."

Yahoo's database is built out of commodity Intel boxes, strung together in large clusters. "The classic industry approach has been to go for big SMP [symmetric multiprocessing] boxes," Hasan explained. "We started from the ground up with the premise that all you get to use is commodity hardware and you get to take lots of little boxes and put them together."

Yahoo's database technology came out of work begun at Mahat Technologies, a Seattle-based start-up that Yahoo quietly acquired in November 2005 for an undisclosed sum.

Yahoo started with the PostgreSQL engine and replaced the query processing layer with code designed for its commodity hardware cluster. The result is a database made possible by both hardware and software innovations. For example, SQL databases are organized as tables, which consist of rows and columns. They are traditionally arranged as rows of data, but Yahoo chose to store its data as distributed columns.

"What we chose to do is organize it as columns," said Hasan. "What that enables, especially with deep analytics queries, is that you can go to only the data that interests you, which makes it very, very effective in terms reducing the amount of data you have to move through for a particular query."

Yahoo is also using advanced techniques for data compression and parallel vector query processing, a method for using parallel processing more efficiently.

Google's BigTable database also uses commodity hardware clusters, but Hasan said that Yahoo's approach differs in that it is designed for an SQL interface. "What that enables is that you can write your programs very, very cheaply," said Hasan. "Typically with BigTable, you'd be writing a C++ or a Java program. Whereas what we can do is get the same job done with SQL, which is much more productive from a programming perspective."

The reason Yahoo developed its database was that commercial database providers just couldn't meet its needs. Hasan said that the commercial vendors did pretty well up to about 25 terabytes, and could even manage up to 100 terabytes. "Our needs are about 100 times higher than that," he said. "The other part we ran into was if you look at the cost, even at 100 terabytes, our engine is roughly 10 and 20 times more cost effective. That's because we were able to build in specializations for our needs."

Yahoo's data needs are substantial. According to Hasan, the travel industry's Sabre system handles 50 million events per day, credit card company Visa handles 120 million events a day, and the New York Stock Exchange has handled over 225 million events in a day. Yahoo, he said, handles 24 billion events a day, fully two orders of magnitude more than other non-Internet companies.

About the Author(s)

Thomas Claburn

Editor at Large, Enterprise Mobility

Thomas Claburn has been writing about business and technology since 1996, for publications such as New Architect, PC Computing, InformationWeek, Salon, Wired, and Ziff Davis Smart Business. Before that, he worked in film and television, having earned a not particularly useful master's degree in film production. He wrote the original treatment for 3DO's Killing Time, a short story that appeared in On Spec, and the screenplay for an independent film called The Hanged Man, which he would later direct. He's the author of a science fiction novel, Reflecting Fires, and a sadly neglected blog, Lot 49. His iPhone game, Blocfall, is available through the iTunes App Store. His wife is a talented jazz singer; he does not sing, which is for the best.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights