Commentary

Art Wittmann
 

Data Deduplication: Did You Say 20% Or 20 Times?

Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.

Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.The issue of how one identifies duplicate content is a thorny one. In simple file-based systems, just look for duplicate file instances. The system stores the first WIN.EXE, then when it encounters another duplicate WIN.EXE, a pointer is created. Block-based systems chop content up into fixed-size blocks, usually a few kilobytes in length, and then compare those blocks rather than whole files.

The efficiency of block comparison is significantly different than the efficiency of file comparison. Add onto that clever algorithms to decide what block size to use and you can imagine systems that yield wildly different data compression ratios. Data Domain and its ilk are the pioneers in this technology, and there seems to be some agreement that compression ratios of 10:1 or 20:1 are not unreasonable.


More Software Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

With all this in mind, you can imagine my surprise when, as I interviewed EMC's general manager of enterprise storage devices, Rich Napolitano, about its newly announced storage libraries, he claimed a modest 20% improvement in storage capacity. That number surprised me enough that I did a bit of double take, asking Napolitano if he thought the competitors where stretching the truth. His response was, "We claim 20% or more."

EMC's deduplication technology came to it via its purchase of Avamar Technologies in late 2006. Despite Napolitano's apparent modesty, I felt compelled to check EMC's Web site. Well, so much for Napolitano's modest compression claims. The first "benefit" listed in EMC's online documentation is: "Reduce daily backup data up to 500x, backup times up to 10x, and total storage up to 50x." That 50-times reduction in total storage is an interesting claim. It almost certainly assumes that many large duplicate files are being stored -- perhaps a number of large database snapshots?

But regardless of how the various factions at EMC came up with their numbers, 20% reduction or 98%, the notion of one's mileage varying has never been more true.

Just as always, don't believe the vendors' claims, try it out on your data first. Data deduplication technology is an amazingly useful thing and at some point it will be pervasive in most storage systems, but your mileage most certainly will vary.


Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
T-Shirt Giveaway T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting!
Subscribe to RSS

Resource Links