Does MapReduce Signal The End Of The Relational Era?
Companies such as Google, Yahoo, and Microsoft that operate Internet-scale cloud services need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Each of these companies has developed its own strategy to support parallel computations over multiple petabyte data sets on large clusters of computers.
Companies such as Google, Yahoo, and Microsoft that operate Internet-scale cloud services need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Each of these companies has developed its own strategy to support parallel computations over multiple petabyte data sets on large clusters of computers.As I wrote last week, the Google Systems Infrastructure Team used Google's MapReduce software framework to sort an astounding one petabyte of data (10 trillion 100-byte records) on 4,000 computers in six hours and two minutes. Earlier this year, Yahoo used Hadoop, an open-source MapReduce implementation, to sort one terabyte of data on 1,000 computers in 209 seconds on a 910-node cluster. MapReduce/Hadoop is a parallel programming model where users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
MapReduce adoption has not been without controversy. Earlier this year, database pioneer Michael Stonebraker decried MapReduce and MapReduce clones such as Hadoop, at least from the perspective of the database community, as:
1. A giant step backward in the programming paradigm for large-scale data intensive applications
2. A sub-optimal implementation, in that it uses brute force instead of indexing
3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
4. Missing most of the features that are routinely included in current DBMS
5. Incompatible with all of the tools DBMS users have come to depend on.
Not surprisingly, then, other MapReduce variants have sprang up in the past few months that attempt to integrate MapReduce with SQL including Dryad at Microsoft, Pig at Yahoo, Hive at Facebook, and Jaql at IBM. Other platforms that provide both SQL and MapReduce interfaces within a single runtime environment include a couple of commercial frameworks, Greenplum and Aster Data.
Last year, Michael Isard of Microsoft Research gave a fascinating Google Tech Talk on the Google campus that's been posted on YouTube, "Dryad: A general-purpose distributed execution platform", about Microsoft's answer to MapReduce, which featured some spirited Q&A from Google engineers steeped in the MapReduce style of functional programming.
Functional programming emphasize rules, pattern-matching and the application of mathematical functions, in contrast to procedural languages like C++, Java, Basic, and database query languages such as SQL, which basically tell a computer (or cluster of computers) what to do, step-by-step: i.e., open a file, read a number, multiply by 1,000, or display something.
In a recent post, Joe Hellerstein, a professor of Computer Science at the University of California, Berkeley, recounts that Berkeley computer science undergraduates now must learn MapReduce, boasting that "MapReduce has brought a new wave of excited, bright developers to the challenge of writing parallel programs against Big Data." Similar enthusiasm for MapReduce has lead Bill McColl and others to proclaim "The End Of The Relational Era" but I'm inclined to think reports detailing the eminent death of relational databases and SQL are greatly exaggerated. I think what's more likely to happen in the near future is that major database vendors will begin offering capabilities to sort and manipulate massive data sets either directly with MapReduce or with SQL-like front-ends that will reduce MapReduce complexity. It may be too early to choose a dominant paradigm for data analytics for cloud-scale data sets but, given the familiarity of large number of developers and DBAs with SQL, I'd be surprised if a strictly functional programming paradigm for large-scale data intensive applications ends up carrying the day.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.