Software // Information Management
08:04 PM
Connect Directly
The Analytics Job and Salary Outlook for 2016
Jan 28, 2016
With data science and big data top-of-mind for all types of organizations, hiring analytics profes ...Read More>>

Yahoo Claims Record With Petabyte Database

Yahoo claims it has the largest SQL database in a production environment and that it will grow larger.

The result is a database made possible by both hardware and software innovations. For example, SQL databases are organized as tables, which consist of rows and columns. They are traditionally arranged as rows of data, but Yahoo chose to store its data as distributed columns.

"What we chose to do is organize it as columns," said Hasan. "What that enables, especially with deep analytics queries, is that you can go to only the data that interests you, which makes it very, very effective in terms reducing the amount of data you have to move through for a particular query."

Yahoo is also using advanced techniques for data compression and parallel vector query processing, a method for using parallel processing more efficiently.

Google's BigTable database also uses commodity hardware clusters, but Hasan said that Yahoo's approach differs in that it is designed for an SQL interface. "What that enables is that you can write your programs very, very cheaply," said Hasan. "Typically with BigTable, you'd be writing a C++ or a Java program. Whereas what we can do is get the same job done with SQL, which is much more productive from a programming perspective."

The reason Yahoo developed its database was that commercial database providers just couldn't meet its needs. Hasan said that the commercial vendors did pretty well up to about 25 terabytes, and could even manage up to 100 terabytes. "Our needs are about 100 times higher than that," he said. "The other part we ran into was if you look at the cost, even at 100 terabytes, our engine is roughly 10 and 20 times more cost effective. That's because we were able to build in specializations for our needs."

Yahoo's data needs are substantial. According to Hasan, the travel industry's Sabre system handles 50 million events per day, credit card company Visa handles 120 million events a day, and the New York Stock Exchange has handled over 225 million events in a day. Yahoo, he said, handles 24 billion events a day, fully two orders of magnitude more than other non-Internet companies.

2 of 2
Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
How to Knock Down Barriers to Effective Risk Management
Risk management today is a hodgepodge of systems, siloed approaches, and poor data collection practices. That isn't how it should be.
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.