Yahoo claims it has the largest SQL database in a production environment and that it will grow larger.
The result is a database made possible by both hardware and software innovations. For example, SQL databases are organized as tables, which consist of rows and columns. They are traditionally arranged as rows of data, but Yahoo chose to store its data as distributed columns.
"What we chose to do is organize it as columns," said Hasan. "What that enables, especially with deep analytics queries, is that you can go to only the data that interests you, which makes it very, very effective in terms reducing the amount of data you have to move through for a particular query."
Yahoo is also using advanced techniques for data compression and parallel vector query processing, a method for using parallel processing more efficiently.
Google's BigTable database also uses commodity hardware clusters, but Hasan said that Yahoo's approach differs in that it is designed for an SQL interface. "What that enables is that you can write your programs very, very cheaply," said Hasan. "Typically with BigTable, you'd be writing a C++ or a Java program. Whereas what we can do is get the same job done with SQL, which is much more productive from a programming perspective."
The reason Yahoo developed its database was that commercial database providers just couldn't meet its needs. Hasan said that the commercial vendors did pretty well up to about 25 terabytes, and could even manage up to 100 terabytes. "Our needs are about 100 times higher than that," he said. "The other part we ran into was if you look at the cost, even at 100 terabytes, our engine is roughly 10 and 20 times more cost effective. That's because we were able to build in specializations for our needs."
Yahoo's data needs are substantial. According to Hasan, the travel industry's Sabre system handles 50 million events per day, credit card company Visa handles 120 million events a day, and the New York Stock Exchange has handled over 225 million events in a day. Yahoo, he said, handles 24 billion events a day, fully two orders of magnitude more than other non-Internet companies.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.