Data deduplication is the latest reduction method to move from secondary storage applications to primary storage systems. There are good reasons for the move, although primary storage isn't always an easy fit for data reduction technology. Large enterprises' primary storage performance requirements are more stringent, especially when it comes to I/O response times and latency. Primary storage systems also have to meet substantially higher availability and reliability standards than backup stores. This makes them leaner, less-target-rich environments for data reduction, but they're also three to 10 times more expensive on a per-gigabyte basis than backup repositories. Small storage reductions can save significantly on space, power, and cooling. There's also a real possibility of performance boosts.
Vendors ranging from enterprise network-attached storage leader NetApp to startups like Ocarina Networks are readying data deduplicating tools to optimize primary storage capacity. With a range of options coming online in the next year or so, from software upgrades to complete NAS systems, now is the time to investigate deduping your primary storage. But keep in mind that your data reduction ratios may be closer to 2-to-1 than 20-to-1. We recommend using conservative data reduction ratios when developing budgets and ROI calculations.
Space And Time
Deduplication systems find and eliminate duplicate data by dividing files into chunks and looking for chunks that have the same data. The major difference is how they chunk data. The simplest method is to use fixed size chunks like disk blocks.
NetApp's Write Anywhere File Layout, or WAFL, builds files as lists of blocks. WAFL calculates a checksum of each block, and stores it with the data, whether by a schedule or by an event like a disk reaching a utilization threshold. Blocks that have the same checksum are compared to see if they contain the same data. If they do, WAFL deletes one block and modifies the metadata of the file where the block resided. WAFL will be released as a free software upgrade later this year.
NetApp's approach opts for low overhead instead of high data reduction ratios, so its performance impact should be minimal for the vast majority of applications.
Deduplication using variable block size is more complicated but can identify duplicate data in the body of files saved elsewhere. GreenBytes' ZFS+ adds a real-time variable size block to Sun's open source ZFS file system. GreenBytes' Cypress NAS appliance, based on Sun's X4540 storage server, uses variable block sizes to deliver 800-MBps performance -- in part through clever use of flash SSDs to store hash lookup tables and logs. GreenBytes' appliances, priced at $100,000 for 46 TB of raw space, are set for release this summer.
Deduplicating frequently accessed data, such as virtual machine images, changes the disk access pattern from reads spread across a volume to accesses of the one deduplicated copy. If the file server has sufficient cache, this replaces many disk I/Os with cache reads. Both NetApp and GreenBytes offer extended read cache options, with GreenBytes offering up to 600GB of flash cache.
Where the other dedupe schemes look at files as sets of bits, Ocarina's Online Storage Optimization Solution takes another approach, recognizing common file types and uses different techniques to space-optimize each. Ocarina breaks complex documents like ZIP files or PowerPoint presentations into their component objects. For example, a PowerPoint slide might be deconstructed into a text block, background, logo, photo, and graph, each of which is separately deduped and compressed with algorithms optimized for each data type. An optimizer replaces the files with a series of links to their constituent deduplicated objects. A reader sits between the user and filer, and reassembles data as users access it.
Ocarina's road map calls for original equipment manufacturers to integrate its technology into NAS systems. Several large OEMs have committed to the project but none has gone public yet.
Howard Marks is chief scientist at Networks Are Our Lives, specializing in data storage, management, and protection. Write to us at iweekletters@techweb.com.
![]()
There's more than one way to reduce the amount of space that files occupy on disk. Some data reduction technologies are built into the file system or operating system of a NAS appliance, while others are appliances that can be added to existing filers. These approaches operate either in real time or post-process. Real-time reduction needs the least disk space because it compresses and/or dedupes data as it's written to the share, but it's compute-intensive and can crimp performance. Post-process data reduction happens after data is written to disk. This approach requires enough space to hold both the inflated and deflated versions of the files, but it can be done during off hours, when it's less likely to effect user response time.
Know Your Options
Stay connected and informed by visiting our Enterprise IT Community!

Become a member today for instant access to free InformationWeek research, expert advice, peer perspectives, and more on the following topics:
- Application Performance Management (APM)
- Security Management
- Mainframe 2.0
- IT Automation
- Service Assurance
Also, visit our Government, Retail and Financial Services groups to see how these technologies apply specifically to those industries.
NOTE: Offer valid for U.S., U.S. possessions, & Canada only.