Yahoo now is seeking recognition for a different accomplishment: The embattled search company and community portal claims that it has the largest SQL database in a production environment.
A petabyte equals one thousand terabytes, one million gigabytes, or 1 trillion megabytes. It's an uncommon enough measurement that the word "petabyte" is not yet recognized by Microsoft Word 2007's spell checker.
"The amount of data that we get is much more than the traditional industry and even in the Internet space is significantly more than other large players," said Hasan. The reason for this, he explained, is that consumers spend twice as long on Yahoo as they do at Google and three times as long on Yahoo as they do at Microsoft's sites. (This, in part, explains Microsoft's interest in acquiring Yahoo.)
The data Yahoo gathers is structured data, as opposed to unstructured data like e-mail and other documents. "It's about how people use our Web site, both from the advertising perspective and from the consumer experience perspective," said Hasan.
Yahoo uses this data to deliver what it hopes will be the best possible experience for its consumers, through personalization, and the most profitable experience for its advertisers, through ad targeting. "Fundamentally, what this is enabling is what we call deep analytics," said Hasan. "Doing deep analytics with a low entry barrier is really what this technology enables."
Yahoo's database is built out of commodity Intel boxes, strung together in large clusters. "The classic industry approach has been to go for big SMP [symmetric multiprocessing] boxes," Hasan explained. "We started from the ground up with the premise that all you get to use is commodity hardware and you get to take lots of little boxes and put them together."
Yahoo's database technology came out of work begun at Mahat Technologies, a Seattle-based start-up that Yahoo quietly acquired in November 2005 for an undisclosed sum.
Yahoo started with the PostgreSQL engine and replaced the query processing layer with code designed for its commodity hardware cluster.
Page 2:
Hardware and Software Innovations
![]()
1
|
2
Next Page »
Stay connected and informed by visiting the CA Solutions Center Community!

Become a member today for instant access to free InformationWeek research, expert advice, peer perspectives, and more on the following topics:
- Application Performance Management (APM)
- Security Management
- Mainframe 2.0
- IT Automation
- Service Assurance
Also, visit our Government and Financial Services groups to see how these technologies apply specifically to those industries.
NOTE: Offer valid for U.S., U.S. possessions, & Canada only.