The InformationWeek -- Blogs
InformationWeek's Analytics Weblog

Topics:   Analytics

  • Email this page E-mail this page
  • Print this page Print this page
  • Bookmark and Share
  • icon

Data Deduplication: Did You Say 20% Or 20 Times?


Posted by Art Wittmann, May 19, 2008 07:51 PM

Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.


The issue of how one identifies duplicate content is a thorny one. In simple file-based systems, just look for duplicate file instances. The system stores the first WIN.EXE, then when it encounters another duplicate WIN.EXE, a pointer is created. Block-based systems chop content up into fixed-size blocks, usually a few kilobytes in length, and then compare those blocks rather than whole files.

The efficiency of block comparison is significantly different than the efficiency of file comparison. Add onto that clever algorithms to decide what block size to use and you can imagine systems that yield wildly different data compression ratios. Data Domain and its ilk are the pioneers in this technology, and there seems to be some agreement that compression ratios of 10:1 or 20:1 are not unreasonable.

With all this in mind, you can imagine my surprise when, as I interviewed EMC's general manager of enterprise storage devices, Rich Napolitano, about its newly announced storage libraries, he claimed a modest 20% improvement in storage capacity. That number surprised me enough that I did a bit of double take, asking Napolitano if he thought the competitors where stretching the truth. His response was, "We claim 20% or more."

EMC's deduplication technology came to it via its purchase of Avamar Technologies in late 2006. Despite Napolitano's apparent modesty, I felt compelled to check EMC's Web site. Well, so much for Napolitano's modest compression claims. The first "benefit" listed in EMC's online documentation is: "Reduce daily backup data up to 500x, backup times up to 10x, and total storage up to 50x." That 50-times reduction in total storage is an interesting claim. It almost certainly assumes that many large duplicate files are being stored -- perhaps a number of large database snapshots?

But regardless of how the various factions at EMC came up with their numbers, 20% reduction or 98%, the notion of one's mileage varying has never been more true.

Just as always, don't believe the vendors' claims, try it out on your data first. Data deduplication technology is an amazingly useful thing and at some point it will be pervasive in most storage systems, but your mileage most certainly will vary.

« Old Media Looks To A Radio Guru | Main | CIO: A Lot To Live Up To »



Sign Up Now
For InformationWeek News Alerts




This is a public forum. United Business Media and its affiliates are not responsible for and do not control what is posted herein. United Business Media makes no warranties or guarantees concerning any advice dispensed by its staff members or readers.

Community standards in this comment area do not permit hate language, excessive profanity, or other patently offensive language. Please be aware that all information posted to this comment area becomes the property of United Business Media LLC and may be edited and republished in print or electronic format as outlined in United Business Media's Terms of Service.

Important Note: This comment area is NOT intended for commercial messages or solicitations of business.




 
 

  1. Sequential Programming: Like Eating Peas with a Straw.
  2. Biomolecular device using self-assembled DNA nanostructures?
  3. Coreinfo v2.0: A Simple Utility to Understand the Manycore Complexity in Windows


Join The InformationWeek Group On LinkedIn


                           


  1. More Reasons Why Linux Misses The Desktop
  2. Too Much Netbook For Too Litl?
  3. Verizon: $350 ETF Is A Go
  4. Motorola Explains Why Droid Doesn't Have Multi-Touch


  1. Florida Hospital Dials Up iPhones For Nurses
  2. Full Nelson: A Web Presence Needs Sizzle, My Nizzle
  3. Is Antivirus Software Dead?
  4. Practical Analysis: The Fastest-Growing Security Threat
  5. InformationWeek Analytics Research: Federated Search
  6. Securing The Cyber Supply Chain

 

  Ars Technica
Boing Boing
Channel 9 Forums
CRN Blogs
Dr.Dobb's Portal: Blogs
Engadget
Gizmodo
GrokLaw
  Lifehacker
Schneier on Security
Slashdot
TechCrunch
Techdirt
Techmeme
Valleywag

  DECEMBER 2008
NOVEMBER 2008
OCTOBER 2008
SEPTEMBER 2008
AUGUST 2008
JULY 2008
JUNE 2008
MAY 2008
  APRIL 2008
MARCH 2008
FEBRUARY 2008
JANUARY 2008
DECEMBER 2007
NOVEMBER 2007
OCTOBER 2007
SEPTEMBER 2007