Powered by InformationWeek Business Technology Network
Topics:
Analytics : Cloud Computing
Does MapReduce Signal The End Of The Relational Era?
As I wrote last week, the Google Systems Infrastructure Team used Google's MapReduce software framework to sort an astounding one petabyte of data (10 trillion 100-byte records) on 4,000 computers in six hours and two minutes. Earlier this year, Yahoo used Hadoop, an open-source MapReduce implementation, to sort one terabyte of data on 1,000 computers in 209 seconds on a 910-node cluster. MapReduce/Hadoop is a parallel programming model where users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. MapReduce adoption has not been without controversy. Earlier this year, database pioneer Michael Stonebraker decried MapReduce and MapReduce clones such as Hadoop, at least from the perspective of the database community, as: 1. A giant step backward in the programming paradigm for large-scale data intensive applications Not surprisingly, then, other MapReduce variants have sprang up in the past few months that attempt to integrate MapReduce with SQL including Dryad at Microsoft, Pig at Yahoo, Hive at Facebook, and Jaql at IBM. Other platforms that provide both SQL and MapReduce interfaces within a single runtime environment include a couple of commercial frameworks, Greenplum and Aster Data. Last year, Michael Isard of Microsoft Research gave a fascinating Google Tech Talk on the Google campus that's been posted on YouTube, "Dryad: A general-purpose distributed execution platform", about Microsoft's answer to MapReduce, which featured some spirited Q&A from Google engineers steeped in the MapReduce style of functional programming. Functional programming emphasize rules, pattern-matching and the application of mathematical functions, in contrast to procedural languages like C++, Java, Basic, and database query languages such as SQL, which basically tell a computer (or cluster of computers) what to do, step-by-step: i.e., open a file, read a number, multiply by 1,000, or display something. In a recent post, Joe Hellerstein, a professor of Computer Science at the University of California, Berkeley, recounts that Berkeley computer science undergraduates now must learn MapReduce, boasting that "MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data." Similar enthusiasm for MapReduce has lead Bill McColl and others to proclaim "The End Of The Relational Era" but I'm inclined to think reports detailing the eminent death of relational databases and SQL are greatly exaggerated. I think what's more likely to happen in the near future is that major database vendors will begin offering capabilities to sort and manipulate massive data sets either directly with MapReduce or with SQL-like front-ends that will reduce MapReduce complexity. It may be too early to choose a dominant paradigm for data analytics for cloud-scale data sets but, given the familiarity of large number of developers and DBAs with SQL, I'd be surprised if a strictly functional programming paradigm for large-scale data intensive applications ends up carrying the day. « 'The CIO Needs To Reapply For His Job' | Main | Net Neutrality Foe Charges Google With Bandwidth Freeloading » |
| Sign Up Now For InformationWeek News Alerts |