Data Deduplication: Did You Say 20% Or 20 Times?Data Deduplication: Did You Say 20% Or 20 Times?
Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.
May 19, 2008
Single-instance storage is one of those concepts that is devilishly simple on its face, but intensely complex to implement correctly. That leads to different implementations -- and that leads to very different performance claims.The issue of how one identifies duplicate content is a thorny one. In simple file-based systems, just look for duplicate file instances. The system stores the first WIN.EXE, then when it encounters another duplicate WIN.EXE, a pointer is created. Block-based systems chop content up into fixed-size blocks, usually a few kilobytes in length, and then compare those blocks rather than whole files.
The efficiency of block comparison is significantly different than the efficiency of file comparison. Add onto that clever algorithms to decide what block size to use and you can imagine systems that yield wildly different data compression ratios. Data Domain and its ilk are the pioneers in this technology, and there seems to be some agreement that compression ratios of 10:1 or 20:1 are not unreasonable. With all this in mind, you can imagine my surprise when, as I interviewed EMC's general manager of enterprise storage devices, Rich Napolitano, about its newly announced storage libraries, he claimed a modest 20% improvement in storage capacity. That number surprised me enough that I did a bit of double take, asking Napolitano if he thought the competitors where stretching the truth. His response was, "We claim 20% or more." EMC's deduplication technology came to it via its purchase of Avamar Technologies in late 2006. Despite Napolitano's apparent modesty, I felt compelled to check EMC's Web site. Well, so much for Napolitano's modest compression claims. The first "benefit" listed in EMC's online documentation is: "Reduce daily backup data up to 500x, backup times up to 10x, and total storage up to 50x." That 50-times reduction in total storage is an interesting claim. It almost certainly assumes that many large duplicate files are being stored -- perhaps a number of large database snapshots? But regardless of how the various factions at EMC came up with their numbers, 20% reduction or 98%, the notion of one's mileage varying has never been more true. Just as always, don't believe the vendors' claims, try it out on your data first. Data deduplication technology is an amazingly useful thing and at some point it will be pervasive in most storage systems, but your mileage most certainly will vary.
About the Author(s)
You May Also Like