The InformationWeek -- Blogs

InformationWeek's Analytics Weblog

Topics:   Analytics : Cloud Computing

  • Email this page E-mail this page
  • Print this page Print this page
  • Bookmark and Share
  • icon

Google Sorts One Petabyte Of Data In 6 Hours


Posted by Roger Smith, Nov 26, 2008 02:49 PM

According to last Friday's Official Google Blog, the Google Systems Infrastructure Team has sorted a record 1 terabyte of data on 1,000 computers in only 68 seconds, which breaks the previous mark of 209 seconds established in July by Yahoo.


Team leader Grzegorz Czajkowski wrote that the team followed the rules of a standard terabyte sort benchmark and used Google's MapReduce software framework that supports parallel computations over large (multiple petabyte) data sets on clusters of computers. Yahoo's effort had featured a 910-node cluster, and used Hadoop, an open-source MapReduce implementation.

The sort benchmark, which was created in 1998 by computer scientist Jim Gray, specifies the input data (10 billion 100-byte records in uncompressed text files), which must be completely sorted and written to disk. Not content with just rewriting the record book, the Google team then decided to up the ante in sorting massive volumes of data.

"Sometimes you need to sort more than a terabyte, so we were curious to find out what happens when you sort more and gave one petabyte (PB) a try," said Czajkowski. "It took six hours and two minutes to sort 1 PB (10 trillion 100-byte records) on 4,000 computers. We're not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly."

One petabyte is a thousand terabytes, or roughly 12 times the amount of archived Web data in the U.S. Library of Congress as of May 2008. One way to put that amount in perspective, according to Czajkowski, is to consider that the aggregate size of data processed by all instances of MapReduce at Google was, on average, 20 PB per day in January 2008. A paper explaining MapReduce on the Google labs site says that the upwards of one thousand MapReduce jobs are executed on Google's clusters every day. So the infrastructure team's MapReduce job that extended the benchmark factors out to 50 typical MapReduce jobs, or one-twentieth the total of all daily MapReduce jobs run on Google's clusters.

As I reported a couple of months ago, Microsoft has its own strategy for sorting massive data sets, which I gleaned from reading a white paper presented at a database conference. All companies that operate Internet-scale cloud services have the need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Google, Yahoo, and Microsoft have developed their own systems that support parallel computations over multiple petabyte data sets on clusters of computers. While Google and Yahoo rely on the map-reduce programming model, Micosoft's Scope programming model intentionally builds on end-user knowledge of relational data and SQL. Microsoft's sorting strategy at this point appears to be primarily conceptual since, unlike Google and Yahoo, it hasn't competed in any recent benchmark tests.

« Apple Allows Some MacBook Videos To Play On External Displays | Main | Will Oracle And SAP Offer Big-Business SaaS? Sort Of »



Sign Up Now
For InformationWeek News Alerts




This is a public forum. United Business Media and its affiliates are not responsible for and do not control what is posted herein. United Business Media makes no warranties or guarantees concerning any advice dispensed by its staff members or readers.

Community standards in this comment area do not permit hate language, excessive profanity, or other patently offensive language. Please be aware that all information posted to this comment area becomes the property of United Business Media LLC and may be edited and republished in print or electronic format as outlined in United Business Media's Terms of Service.

Important Note: This comment area is NOT intended for commercial messages or solicitations of business.




 
 

  1. Detecting Scalability Problems With Intel Parallel Universe Portal
  2. Just Say No To SFAQL Parallelism
  3. QuickThread: A New C++ Multicore Library


Join The InformationWeek Group On LinkedIn


                           


  1. Thoughts On The Motorola Droid
  2. Specs For Next Motorola Android Phone Leak
  3. Encryption Is Cloud Computing Security Savior


  1. Apple Defends App Store Approval Process
  2. Obama Calls For Math, Science Push
  3. Jailbroken iPhones Vulnerable To 'Duh' Worm
  4. AT&T, LG Launch 3G Netbook
  5. NIST Drafts Cybersecurity Guidance
  6. Verizon Nabs Omnia II Smartphone

 

  Ars Technica
Boing Boing
Channel 9 Forums
CRN Blogs
Dr.Dobb's Portal: Blogs
Engadget
Gizmodo
GrokLaw
  Lifehacker
Schneier on Security
Slashdot
TechCrunch
Techdirt
Techmeme
Valleywag

  DECEMBER 2008
NOVEMBER 2008
OCTOBER 2008
SEPTEMBER 2008
AUGUST 2008
JULY 2008
JUNE 2008
MAY 2008
  APRIL 2008
MARCH 2008
FEBRUARY 2008
JANUARY 2008
DECEMBER 2007
NOVEMBER 2007
OCTOBER 2007
SEPTEMBER 2007