These processes purposely make multiple copies of data to protect the organization from data loss or corruption. The problem is more data means more copies. It is made worse by the fact that there are so many opportunities to make a copy now. You end up with copy sprawl. Copy management has to become a key priority for data centers in the coming year.
What Is Copy Sprawl?
Copy sprawl is the proliferation of copies of an original piece of data. Let's use a database as an example. A form of a copy is made from the start when the database is either RAIDed or mirrored. Before any major changes are made to the database the database administrator typically makes a full copy of it prior to implementing the change.
These copies are rarely ever cleaned up. The database administrator also likely makes his own nightly backup copies of the database, outside the scope of the backup application. The database administrator might also replicate the data to another server locally or at a disaster recovery site.
[ Considering flash storage? Read Flash Storage Has Special Security Needs. ]
Then the storage manager typically makes snapshot copies of the database and sometimes additional full copies. Although snapshot copies do not take physical storage capacity until changes occur, those changes do happen, and capacity consumption ensues. In either case the storage manager has to maintain and be aware of these copies and have a process to clean them up.
Typically, the storage manager will also make sure that mission-critical data like this is replicated off-site. This means another copy of the database at the disaster recovery site. It also typically means that all the copies that the DB administrator made above also are replicated to the remote site.
Finally, the backup administrator protects the data with a backup application. In modern backup architectures this probably means a copy to disk, a copy to tape (or another disk), and a copy that is either replicated or transported off-site or both.
If you add up all those copies made you can see that copy data is the bigger problem when it comes to dealing with storage growth. Also, you can see that these copies are landing on every tier of storage; it's not just the secondary tier. The problem goes well beyond the database example I described above; we see the same problem with user productivity data. In fact, with this data the versioning is typically worse than it is with other data.
Copies Weaken Protection
You might think that all these copies of data would protect you from almost any kind of data loss. The reality is they don't -- the problem is that there is no communication between the makers of the copies and as a result no correlation of them. In other words, if there is a data failure, no one knows which copy of the lost file should be recovered. We have seen and heard of countless cases of the wrong version of the file being recovered and days' or weeks' worth of work being lost.
The very first step is to make sure that your next primary-storage solution uses deduplication, preferably inline. That at least eliminates much of the capacity consumption problem immediately. If primary storage deduplication is not in the cards, then look for a backup deduplication appliance that can accept data from multiple sources. Make sure that all the copy sources send data to this appliance instead of to their original tier. This means the deduplication appliance would have to accept inputs from applications and even the file system itself.
The next step is to look at your data protection process as a whole and look for ways to eliminate some of these copies from occurring. Maybe it's time for a better application that can manage and organize all this redundant data. Have one application that creates copies and manages their placement throughout the enterprise. It is also important for those copies of data that are legitimately needed that the organization documents the order in which they should be recovered, so in the case of data loss IT personnel now have a prioritized list of which copy of the data they should pull.
The final step is to look for a software tool that can monitor your environment for multiple copies or similar copies of data. Ideally this tool would allow you to then move the redundant data to a high-capacity archive or the cloud or simply delete it.