Commentary
Does MapReduce Signal The End Of The Relational Era?
Companies such as Google, Yahoo, and Microsoft that operate Internet-scale cloud services need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Each of these companies has developed its own strategy to support parallel computations over multiple petabyte data sets on large clusters of computers.Companies such as Google, Yahoo, and Microsoft that operate Internet-scale cloud services need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Each of these companies has developed its own strategy to support parallel computations over multiple petabyte data sets on large clusters of computers.As I wrote last week, the Google Systems Infrastructure Team used Google's MapReduce software framework to sort an astounding one petabyte of data (10 trillion 100-byte records) on 4,000 computers in six hours and two minutes. Earlier this year, Yahoo used Hadoop, an open-source MapReduce implementation, to sort one terabyte of data on 1,000 computers in 209 seconds on a 910-node cluster. MapReduce/Hadoop is a parallel programming model where users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
MapReduce adoption has not been without controversy. Earlier this year, database pioneer Michael Stonebraker decried MapReduce and MapReduce clones such as Hadoop, at least from the perspective of the database community, as:
More Software Insights
White Papers
- Creating the Enterprise-Class Tablet Environment - by Yankee Group
- The BlackBerry PlayBook tablet's Good Bones - by BlackBerry
Reports
More >>Webcasts
- Maximize ROI with Database Consolidation onto Private Clouds
- Outsourcing Security: What Every Potential Cloud Security Customer Should Know
1. A giant step backward in the programming paradigm for large-scale data intensive applications 2. A sub-optimal implementation, in that it uses brute force instead of indexing 3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago 4. Missing most of the features that are routinely included in current DBMS 5. Incompatible with all of the tools DBMS users have come to depend on.
Not surprisingly, then, other MapReduce variants have sprang up in the past few months that attempt to integrate MapReduce with SQL including Dryad at Microsoft, Pig at Yahoo, Hive at Facebook, and Jaql at IBM. Other platforms that provide both SQL and MapReduce interfaces within a single runtime environment include a couple of commercial frameworks, Greenplum and Aster Data.
Last year, Michael Isard of Microsoft Research gave a fascinating Google Tech Talk on the Google campus that's been posted on YouTube, "Dryad: A general-purpose distributed execution platform", about Microsoft's answer to MapReduce, which featured some spirited Q&A from Google engineers steeped in the MapReduce style of functional programming.
Functional programming emphasize rules, pattern-matching and the application of mathematical functions, in contrast to procedural languages like C++, Java, Basic, and database query languages such as SQL, which basically tell a computer (or cluster of computers) what to do, step-by-step: i.e., open a file, read a number, multiply by 1,000, or display something.
In a recent post, Joe Hellerstein, a professor of Computer Science at the University of California, Berkeley, recounts that Berkeley computer science undergraduates now must learn MapReduce, boasting that "MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data." Similar enthusiasm for MapReduce has lead Bill McColl and others to proclaim "The End Of The Relational Era" but I'm inclined to think reports detailing the eminent death of relational databases and SQL are greatly exaggerated. I think what's more likely to happen in the near future is that major database vendors will begin offering capabilities to sort and manipulate massive data sets either directly with MapReduce or with SQL-like front-ends that will reduce MapReduce complexity. It may be too early to choose a dominant paradigm for data analytics for cloud-scale data sets but, given the familiarity of large number of developers and DBAs with SQL, I'd be surprised if a strictly functional programming paradigm for large-scale data intensive applications ends up carrying the day.
Related Reading
| To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy. | |
|
|
T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting! |
Subscribe to RSSResource Links
This Week's Issue
Technology Whitepapers
- Mobile BI: Actionable Intelligence for the Agile Enterprise
- Creating the Enterprise-Class Tablet Environment - by Yankee Group
- How To Regain IT Control In An Increasingly Mobile World - by BlackBerry
- The BlackBerry PlayBook tablet's Good Bones - by BlackBerry
- New Visual and Wizard-Driven Paradigms for Exploring Data and Developing Analytic Workflows
Featured Broadcast
This white paper explains how to create a manageable, scalable environment suited to answer real-time business needs by building out a data center on a standards-based, virtualization-aware, energy-efficient and affordable platform. Plus, learn how virtualization is making the jump from the server realm into the application, mobile and database worlds in the additional resources section.
Learn More












