Big Data. Big Decisions
InformationWeek
Special Coverage Series


Microsoft Team Shrinks Big Data By Deleting It

Microsoft Research takes new approach to data compression, but so far it works only on Azure.

One of the big disadvantages of big data is that it requires big storage--potentially hundreds of petabytes, exabytes, or even zettabytes of storage.

Microsoft Research published a paper this week describing a more efficient way of cramming the 4 trillion objects stored in its cloud-based Windows Azure Storage into slightly less Windows Azure Storage.

More Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

The reason for doing this is simple enough, as stated in Microsoft's release: "Storing massive amounts of information in the cloud comes with costs, however--primarily the cost of storing all that digital data."

[ Major League Soccer taps big data to bolster team performance. Read more at Soccer Wonks Learn Tough Big Data Lesson. ]

Those costs include the disk arrays; expansion disks; replacement arrays; extra bills for support and repair; additional climate-controlled data center space to house all the extra disks; real estate for the extra data center space; salaries and benefits for skilled technicians to hook up, manage, and expand all that storage; bandwidth to make it available to customers; programmers to write the software to make cloud-based disks useful--and of course, PR and marketing staffs to spread the news about all that storage space.

The costs add up quickly and multiply in line with the number of required disks. That means even a trivial reduction in the space required to store a specific object--whether that reduction comes from better compression, more consistent lifecycle management, or accidental but frequent deletions--can dramatically reduce costs.

Save Space by Deleting Stuff

With the goal of taking a big bite out of storage costs, Microsoft's team--culled from the Windows Azure division and Microsoft Research--built compression software that takes a different approach to storage management. So far, no other vendors have joined Microsoft in promoting deletion as an approach to mass storage. But that could change as the technology emerges from research-and-development phase and develops a more practical track record.

In a commercial cloud environment, Microsoft's Douglas Gantenbein points out, the space required to store a single file isn't equal to the amount of disk space in which that file will fit. Each file must be stored at least three times on different disk arrays or different servers in order to safeguard it against crashes or other disasters.

In traditional lossless data compression algorithms go through a file and take out sequences of bits that are statistically redundant, keeping logs of what they eliminated and where it was so everything can be replaced later, when the file is uncompressed.

"Lossy" data compression methods, such as those used in MP3 files, eliminate levels of detail in order to reduce the amount of data to be stored.

The new storage approach for Windows Azure--called "lazy erasure coding"--is similar to lossless compression in that data is removed, but a shortened, coded version of the data is created that allows it to be replaced later on. When a chunk of data is compressed, it is split into two groups: segments to be stored, and parity segments the software will use for comparison to ensure the data isn't corrupted or missing anything after it is uncompressed. Then all the data segments and parity segments are distributed to different physical locations so the loss of one won't mean the loss of all, and the original three copies are deleted. The result is a series of data chunks that can be reconstituted bit for bit, but that occupy half the space they did while uncompressed.

The method is similar to Reed-Solomon coding, a technique that was invented in 1960 that was used in the U.S. space program and in error-correction for compact discs.

Looking for even more compression, the Microsoft group made a bet on the reliability of its hardware--reducing the number of parity fragments needed to reconstruct the data in order to make reconstruction faster and reduce the overall space required. This effort, which replaced Reed-Solomon codes with Local Reconstruction Codes, lowered the compression even further, from a total overhead of 1.5 (compressed to half the space of a live file) to 1.29 (less than a third the space).

Microsoft techs described the technique in greater detail in a presentation at the 2012 Usenix Annual Technical Conference last June; it's also detailed in a white paper that can be viewed or downloaded here.

The new technique--in short, a modification of Reed-Solomon compression using the Microsoft-invented Local Reconstruction Codes to increase the compression even further--was designed for Azure but could find its way into other Microsoft products as well. It would be particularly appropriate for "flash appliances" that use several flash-memory drives as part of a single storage unit, or possibly for solid-state drives used in laptops and other portable devices for which weight and battery power are greater concerns than disk space.

Even small IT shops can now afford thin provisioning, performance acceleration, replication, and other features to boost utilization and improve disaster recovery. Also in the new, all-digital Store More special issue of InformationWeek SMB: Don't be fooled by the Oracle's recent Xsigo buy. (Free registration required.)



Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

BYTE encourages readers to engage in spirited, healthy debate, including taking us to task. However, BYTE moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. BYTE further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

Follow InformationWeek

By The Numbers

What Are Your Primary Concerns About Using Big Data Software?

Base: 417 respondents at organizations using or planning to deploy data analytics, BI or statistical analysis software
Data: InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

What Do You Think?

What's your attitude about SQL analysis on top of Hadoop?
We want fast, standard SQL analysis capabilities on Hadoop ASAP
Hadoop is for unstructured data; SQL is for relational databases
We'll give SQL on Hadoop a try, but relational DBs will remain the mainstay
Given strong SQL support on Hadoop, we'd nix the data warehouse
We're not interested in Hadoop
No opinion



Related Content

From Our Sponsor

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Business leaders often need a visual snapshot of data to quickly grasp and use it. This paper identifies five challenges in presenting data and how visual analytics can resolve them. Solutions are suggested to overcome the challenges of: speed, data clarity, data quality, displaying meaningful results, and dealing with outliers.

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Today's competitive advantage requires a deeper understanding of your business, your market and your customers. As an IT executive, you can drive that knowledge transformation. In this white paper, learn how to make decisions as a strategic business leader and three steps to begin an analytics initiative within your enterprise.

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

High-performance data visualization turns sophisticated analyses into meaningful graphics, leading to faster and smarter decision making. In this white paper, learn how visual analytics can transform big data, with additional features such as real-time functionality, mobile compatibility, robust applications for technical groups and accessibility for nontechnical users.

Big Data: Lessons from the Leaders

Big Data: Lessons from the Leaders

Financial performance, competitive advantage, operational efficiency, strategic decision making - every business goal can extract value from big data, and the time for doubt or inaction has long passed. In this Economist Intelligence Unit report, in-depth interviews with data pioneers reveal the link between the effective use of big data and the bottom line among other results.

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Which came first, the data or the decision? This white paper makes the case for having a decision in mind, then tailoring big data's volume, variety and velocity to achieve business results such as overcoming customer dissatisfaction or creating well-informed strategies in real time.

Informationweek Reports

Research: The Big Data Management Challenge

Research: The Big Data Management Challenge

The challenge of big data is real, but most organizations don't differentiate 'big data' from traditional data, and nearly 90% of respondents to our survey use conventional databases as the primary means of handling data. We'll help you understand what constitutes big data (it's not just size) and the numerous management challenges it poses.