The InformationWeek -- Blogs
InformationWeek's Analytics Weblog

Topics:   Analytics : Cloud Computing

  • Email this page E-mail this page
  • Print this page Print this page
  • Bookmark and Share
  • icon

Does MapReduce Signal The End Of The Relational Era?


Posted by Roger Smith, Dec 4, 2008 05:23 PM

Companies such as Google, Yahoo, and Microsoft that operate Internet-scale cloud services need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Each of these companies has developed its own strategy to support parallel computations over multiple petabyte data sets on large clusters of computers.


As I wrote last week, the Google Systems Infrastructure Team used Google's MapReduce software framework to sort an astounding one petabyte of data (10 trillion 100-byte records) on 4,000 computers in six hours and two minutes. Earlier this year, Yahoo used Hadoop, an open-source MapReduce implementation, to sort one terabyte of data on 1,000 computers in 209 seconds on a 910-node cluster. MapReduce/Hadoop is a parallel programming model where users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.

MapReduce adoption has not been without controversy. Earlier this year, database pioneer Michael Stonebraker decried MapReduce and MapReduce clones such as Hadoop, at least from the perspective of the database community, as:

1. A giant step backward in the programming paradigm for large-scale data intensive applications
2. A sub-optimal implementation, in that it uses brute force instead of indexing
3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
4. Missing most of the features that are routinely included in current DBMS
5. Incompatible with all of the tools DBMS users have come to depend on.

Not surprisingly, then, other MapReduce variants have sprang up in the past few months that attempt to integrate MapReduce with SQL including Dryad at Microsoft, Pig at Yahoo, Hive at Facebook, and Jaql at IBM. Other platforms that provide both SQL and MapReduce interfaces within a single runtime environment include a couple of commercial frameworks, Greenplum and Aster Data.

Last year, Michael Isard of Microsoft Research gave a fascinating Google Tech Talk on the Google campus that's been posted on YouTube, "Dryad: A general-purpose distributed execution platform", about Microsoft's answer to MapReduce, which featured some spirited Q&A from Google engineers steeped in the MapReduce style of functional programming.

Functional programming emphasize rules, pattern-matching and the application of mathematical functions, in contrast to procedural languages like C++, Java, Basic, and database query languages such as SQL, which basically tell a computer (or cluster of computers) what to do, step-by-step: i.e., open a file, read a number, multiply by 1,000, or display something.

In a recent post, Joe Hellerstein, a professor of Computer Science at the University of California, Berkeley, recounts that Berkeley computer science undergraduates now must learn MapReduce, boasting that "MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data." Similar enthusiasm for MapReduce has lead Bill McColl and others to proclaim "The End Of The Relational Era" but I'm inclined to think reports detailing the eminent death of relational databases and SQL are greatly exaggerated. I think what's more likely to happen in the near future is that major database vendors will begin offering capabilities to sort and manipulate massive data sets either directly with MapReduce or with SQL-like front-ends that will reduce MapReduce complexity. It may be too early to choose a dominant paradigm for data analytics for cloud-scale data sets but, given the familiarity of large number of developers and DBAs with SQL, I'd be surprised if a strictly functional programming paradigm for large-scale data intensive applications ends up carrying the day.

« 'The CIO Needs To Reapply For His Job' | Main | Net Neutrality Foe Charges Google With Bandwidth Freeloading »



Sign Up Now
For InformationWeek News Alerts




This is a public forum. United Business Media and its affiliates are not responsible for and do not control what is posted herein. United Business Media makes no warranties or guarantees concerning any advice dispensed by its staff members or readers.

Community standards in this comment area do not permit hate language, excessive profanity, or other patently offensive language. Please be aware that all information posted to this comment area becomes the property of United Business Media LLC and may be edited and republished in print or electronic format as outlined in United Business Media's Terms of Service.

Important Note: This comment area is NOT intended for commercial messages or solicitations of business.




 
 

  1. Sequential Programming: Like Eating Peas with a Straw.
  2. Biomolecular device using self-assembled DNA nanostructures?
  3. Coreinfo v2.0: A Simple Utility to Understand the Manycore Complexity in Windows


Join The InformationWeek Group On LinkedIn


                           


  1. More Reasons Why Linux Misses The Desktop
  2. Too Much Netbook For Too Litl?
  3. Motorola Explains Why Droid Doesn't Have Multi-Touch
  4. Sprint And T-Mobile Headed The Wrong Direction


  1. Global CIO: Cloud Computing's New Name: Who Will Win $100 Million?
  2. Google Computes News Quality
  3. Internet Use Increases Social Connectivity
  4. Review: Motorola Cliq Smartphone
  5. Florida Hospital Dials Up iPhones For Nurses
  6. Full Nelson: A Web Presence Needs Sizzle, My Nizzle

 

  Ars Technica
Boing Boing
Channel 9 Forums
CRN Blogs
Dr.Dobb's Portal: Blogs
Engadget
Gizmodo
GrokLaw
  Lifehacker
Schneier on Security
Slashdot
TechCrunch
Techdirt
Techmeme
Valleywag

  DECEMBER 2008
NOVEMBER 2008
OCTOBER 2008
SEPTEMBER 2008
AUGUST 2008
JULY 2008
JUNE 2008
MAY 2008
  APRIL 2008
MARCH 2008
FEBRUARY 2008
JANUARY 2008
DECEMBER 2007
NOVEMBER 2007
OCTOBER 2007
SEPTEMBER 2007