Comply Or Die: Data Disposition Must Be A Priority
IT groups rethinking the "save everything forever" approach find deletion and retention policies and tools must be razor sharp to cut through a morass of regulations.
Before you can chuck a piece of information, you have to know what it is. Thus, index and classification technologies are key. That's where CVR's Brooks is starting. The company bought Autonomy's Intelligent Data Operating Layer, or Idol, a software platform for enterprise search and classification, and centralized its storage around 10 geographically dispersed storage area networks. The platform uses connectors to tap into the SANs to index the content stored there.
Brooks started with a backlog of unindexed information stored in the SANs, including 1.9 million e-mails and 600,000 documents. It took about 10 days to create a searchable index of those data stores, and now the Idol engine keeps up with new data that gets moved into the storage networks.
It sounds great, but the dark side of indexing is that it adds to your overall data store. In fact, Brooks' team initially failed to properly size the database for the index because the team didn't anticipate just how large it would be. Autonomy says a typical Idol index runs 20% to 25% of the total data store, depending on the level of indexing, from basic metadata to cataloging the full contents of a file.
The next step is to categorize all this information for retention and disposition. CVR is still working through its disposition policy, though Brooks expects it to be in place by the first quarter of 2009. "Our objective is to take out the human element," he says. "Two people can look at the same document and categorize it differently. Any time there's human intervention, courts can question your consistency." By automating the process, he hopes to avoid dispute on the final disposition of a file.
Brooks' team is working with various company departments, including legal and accounting, as well as business units on a policy that will designate different information categories to meet all the requirements for retention. Once the policy is in place, the Idol engine will assign data to the most appropriate category. "If it goes into a folder that has policies for financial documents, in seven years it will get disposed of," Brooks says. "If a document is environmental, that's lifetime storage."
Because CVR's policy isn't finalized, the company hasn't gotten rid of any data. Brooks also says that once information reaches its retention limit, the company will start with a manual review to ensure the data should be destroyed. But his ultimate goal is to automate the destruction. "The manual intervention is where you get in trouble--everything becomes a judgment call," he says. "If the machine is doing it based on algorithms and parameters, at least your company can be consistent."
He's also aware of the need for legal holds. In the event of litigation, the plan is to use the Idol technology to search for relevant data and then move that information to a separate repository. Brooks' IT team also wrote agent software that moves data off corporate laptops and into the SANs whenever the laptops attach to the corporate network. When data is destroyed on the SANs, the agent also will erase it from the laptops.
Do You Really Want To Save That?
Drop in access rate of some older data, such as e-mail, within 60 days
Cost per gigabyte for Tier 1 storage
Respondents who gained high or very high benefits in meeting retention policies through information life-cycle management
Data: Gartner, Oracle, and 2006 InformationWeek reader survey of 291 respondents
Data disposition is a crowded vendor field. For instance, vendors of enterprise content management (ECM) systems--including EMC, Open Text, and IBM (via its FileNet software)--are adding classification, retention, and disposition capabilities to their portfolios. ECM products focus on records management to maintain strict control over official paper and electronic records, such as business contracts and legal documents, while providing content repositories, mechanisms for end users to check documents in and out of those repositories, and version control enforcement.
EMC's Documentum content management system offers the Retention Policy Services module, which lets IT create folders that will enforce specific retention policies. Administrators can choose between automated and manual disposition when information reaches the end of its retention period, and the module supports legal holds to suspend disposition. Documentum licenses the Fast enterprise search engine (recently acquired by Microsoft) to index and search information.
Open Text's Enterprise Library Services, rolled out in October 2007, provides a retention and disposition policy layer across a variety of content repositories, such as archives, file systems, Microsoft SharePoint, and SAP. In December 2007, IBM announced a SOA-based connection between FileNet and the IBM Classification Module. The module automates the classification of unstructured content, including e-mail, through full-text analysis. In March, Hewlett-Packard announced it would acquire Tower Software, an Australian document and records management vendor, to expand its legal discovery and regulatory compliance capabilities.
Before the purchase, HP had included Tower's software in its Integrated Archive Platform, an archive appliance that serves as a central repository for a variety of data, including e-mail, Office documents, and SharePoint and Web content. Once inside the Integrated Archive Platform, the Tower software indexes and categorizes content so administrators can set up retention schedules. At the end of the retention period, the appliance destroys the data, essentially by writing over it in the repository.
Google in the Enterprise SurveyThere's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity products, and 69 percent cite Google Apps' good or excellent mobility. But progress could still stall: 59 percent of nonusers distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.