The InformationWeek -- Blogs

InformationWeek's Analytics Weblog

Topics:   Analytics : Cloud Computing

  • Email this page E-mail this page
  • Print this page Print this page
  • Bookmark and Share
  • icon

Micosoft's SQL Strategy For Massive Data Sets


Posted by Roger Smith, Aug 27, 2008 08:10 PM

Cloud computing service providers like Microsoft, Google, and Yahoo are all hard at work on a new generation of parallel data processing tools that will make it easier for each company to store and analyze enormous data sets such as search logs and click streams.


One of the more interesting papers presented at this week's VLDB (Very Large Data Base) conference in Auckland, New Zealand, "Scope: Easy and Efficient Parallel Processing of Massive Data Sets," (PDF), describes one particular parallel data processing tool developed by Microsoft Research that's being used daily inside Microsoft over petabytes of data on large clusters of thousands of commodity servers, including those Microsoft will use to equip the new $500 million data center the Redmond,Wash.-based company is building in West Des Moines, Iowa.

Microsoft's Scope is similar to Yahoo's Pig, which is a higher-level language on top of Yahoo's Hadoop distributed software framework, or Google's Sawzall, which is a higher-level language on top of the MapReduce framework and the Google File System. According to the VLDB08 paper's authors, Ronnie Chaiken, Bob Jenkins, Per-Ake Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou, where Pig and Sawzall promote a more functional or mathematical programming style, Scope looks much more like SQL.

"SCOPE has a strong resemblance to SQL -- an intentional design choice. . . Users familiar with SQL require little or no training to use Scope. Like SQL, data is modeled as a set of rows composed of typed columns. Every rowset has a well-defined schema ... It allows users to focus on the data transformations required to solve the problem at hand and hides the complexity of the underlying platform and implementation details."

The language is high-level and declarative so that the Scope compiler and optimizer can optimize Scope scripts and improve them over time. All the hardware and implementation details are transparent to users.

According to the paper, Scope is highly extensible. Users can easily create customized operators, including:

"• extractors (for parsing and constructing rows from a file),
• processors (for row-wise processing),
• reducers (for group-wise processing), and
• combiners (for combining rows from two inputs) ...

[all of which] allow users to solve problems that cannot easily be expressed in traditional SQL."

All companies that operate Internet-scale services have the need to store and process massive data sets, such as search logs, Web content collected by crawlers, and click-streams collected from a variety of Web services. Google, Yahoo, and Microsoft have developed their own systems that support parallel computations over large (multiple petabyte) data sets on clusters of computers. Google popularized the map-reduce programming model, largely taken from the map and reduce functions commonly used in a functional or mathematical style of programming.

Yahoo also has a software stack designed for distributed processing of massive data sets. Users write applications in a language called Pig Latin, which is a dataflow language that uses a nested data model. A Pig Latin program is compiled by the Pig system into a sequence of MapReduce operators that are executed using Hadoop, an open-source implementation of MapReduce.

When Pigs Have Wings

According to the Microsoft paper, a MapReduce application written in C++ takes many more lines of code than the corresponding application expressed in Scope; giving an example that requires 70 lines of C++ code but only six lines of Scope code.

Analysis of massive data sets is becoming increasingly valuable for businesses like Microsoft, in order to support new features and do things like improve service quality and detect changes in patterns over time that can detect fraudulent activity. The new Scope (Structured Computations Optimized for Parallel Execution) language is targeted for large-scale data analysis under development at Microsoft. Scope has the advantage of intentionally building on end-user knowledge of relational data and SQL, with some simplifications that ought to make it easier for the company to take advantage of the new parallel processing execution environment the software giant continues to relentlessly build.

« Using A Wiki In Your Enterprise | Main | Web Application Hacks: Upping The Arms Race »



Sign Up Now
For InformationWeek News Alerts




This is a public forum. United Business Media and its affiliates are not responsible for and do not control what is posted herein. United Business Media makes no warranties or guarantees concerning any advice dispensed by its staff members or readers.

Community standards in this comment area do not permit hate language, excessive profanity, or other patently offensive language. Please be aware that all information posted to this comment area becomes the property of United Business Media LLC and may be edited and republished in print or electronic format as outlined in United Business Media's Terms of Service.

Important Note: This comment area is NOT intended for commercial messages or solicitations of business.




 
 

  1. Detecting Scalability Problems With Intel Parallel Universe Portal
  2. Just Say No To SFAQL Parallelism
  3. QuickThread: A New C++ Multicore Library


Join The InformationWeek Group On LinkedIn


                           


  1. AT&T, T-Mobile, Verizon All Offering Black Friday Sales
  2. Verizon Snags Samsung's Omnia II With WinMo 6.5
  3. AT&T's iPhone Stranglehold Ending June 2010?
  4. Apple Says Users To Blame For iPhone Virus


  1. Roll Your Own Ubuntu Private Cloud
  2. Stay On Top of Source Code Security Flaws
  3. Down To Business: How Indian CIOs Stack Up
  4. CIO Profiles: John P. Burke, CIO Of Ambit Energy
  5. How Cloud Computing Changes IT Organizations
  6. Practical Analysis: Smartphones -- Passion To Profit And Productivity

 

  Ars Technica
Boing Boing
Channel 9 Forums
CRN Blogs
Dr.Dobb's Portal: Blogs
Engadget
Gizmodo
GrokLaw
  Lifehacker
Schneier on Security
Slashdot
TechCrunch
Techdirt
Techmeme
Valleywag

  DECEMBER 2008
NOVEMBER 2008
OCTOBER 2008
SEPTEMBER 2008
AUGUST 2008
JULY 2008
JUNE 2008
MAY 2008
  APRIL 2008
MARCH 2008
FEBRUARY 2008
JANUARY 2008
DECEMBER 2007
NOVEMBER 2007
OCTOBER 2007
SEPTEMBER 2007