Commentary
Data Deduplication: Did You Say 20% Or 20 Times?
Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.The issue of how one identifies duplicate content is a thorny one. In simple file-based systems, just look for duplicate file instances. The system stores the first WIN.EXE, then when it encounters another duplicate WIN.EXE, a pointer is created. Block-based systems chop content up into fixed-size blocks, usually a few kilobytes in length, and then compare those blocks rather than whole files.
The efficiency of block comparison is significantly different than the efficiency of file comparison. Add onto that clever algorithms to decide what block size to use and you can imagine systems that yield wildly different data compression ratios. Data Domain and its ilk are the pioneers in this technology, and there seems to be some agreement that compression ratios of 10:1 or 20:1 are not unreasonable.
More Software Insights
White Papers
- Red Alert: Why Tablet Security Matters - by BlackBerry
- New Visual and Wizard-Driven Paradigms for Exploring Data and Developing Analytic Workflows
Reports
More >>Webcasts
- Maximize ROI with Database Consolidation onto Private Clouds
- Effective IT Inventory and Asset Management: From Quagmire to Quick Fix
With all this in mind, you can imagine my surprise when, as I interviewed EMC's general manager of enterprise storage devices, Rich Napolitano, about its newly announced storage libraries, he claimed a modest 20% improvement in storage capacity. That number surprised me enough that I did a bit of double take, asking Napolitano if he thought the competitors where stretching the truth. His response was, "We claim 20% or more."
EMC's deduplication technology came to it via its purchase of Avamar Technologies in late 2006. Despite Napolitano's apparent modesty, I felt compelled to check EMC's Web site. Well, so much for Napolitano's modest compression claims. The first "benefit" listed in EMC's online documentation is: "Reduce daily backup data up to 500x, backup times up to 10x, and total storage up to 50x." That 50-times reduction in total storage is an interesting claim. It almost certainly assumes that many large duplicate files are being stored -- perhaps a number of large database snapshots?
But regardless of how the various factions at EMC came up with their numbers, 20% reduction or 98%, the notion of one's mileage varying has never been more true.
Just as always, don't believe the vendors' claims, try it out on your data first. Data deduplication technology is an amazingly useful thing and at some point it will be pervasive in most storage systems, but your mileage most certainly will vary.
Related Reading
| To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy. | |
|
|
T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting! |
Subscribe to RSSResource Links
This Week's Issue
Technology Whitepapers
- Mobile BI: Actionable Intelligence for the Agile Enterprise
- Creating the Enterprise-Class Tablet Environment - by Yankee Group
- How To Regain IT Control In An Increasingly Mobile World - by BlackBerry
- Red Alert: Why Tablet Security Matters - by BlackBerry
- New Visual and Wizard-Driven Paradigms for Exploring Data and Developing Analytic Workflows
Featured Broadcast
This white paper explains how to create a manageable, scalable environment suited to answer real-time business needs by building out a data center on a standards-based, virtualization-aware, energy-efficient and affordable platform. Plus, learn how virtualization is making the jump from the server realm into the application, mobile and database worlds in the additional resources section.
Learn More












