Before you can chuck a piece of information, you have to know what it is. Thus, index and classification technologies are key. That's where CVR's Brooks is starting. The company bought Autonomy's Intelligent Data Operating Layer, or Idol, a software platform for enterprise search and classification, and centralized its storage around 10 geographically dispersed storage area networks. The platform uses connectors to tap into the SANs to index the content stored there.
Brooks started with a backlog of unindexed information stored in the SANs, including 1.9 million e-mails and 600,000 documents. It took about 10 days to create a searchable index of those data stores, and now the Idol engine keeps up with new data that gets moved into the storage networks.
It sounds great, but the dark side of indexing is that it adds to your overall data store. In fact, Brooks' team initially failed to properly size the database for the index because the team didn't anticipate just how large it would be. Autonomy says a typical Idol index runs 20% to 25% of the total data store, depending on the level of indexing, from basic metadata to cataloging the full contents of a file.
The next step is to categorize all this information for retention and disposition. CVR is still working through its disposition policy, though Brooks expects it to be in place by the first quarter of 2009. "Our objective is to take out the human element," he says. "Two people can look at the same document and categorize it differently. Any time there's human intervention, courts can question your consistency." By automating the process, he hopes to avoid dispute on the final disposition of a file.
Brooks' team is working with various company departments, including legal and accounting, as well as business units on a policy that will designate different information categories to meet all the requirements for retention. Once the policy is in place, the Idol engine will assign data to the most appropriate category. "If it goes into a folder that has policies for financial documents, in seven years it will get disposed of," Brooks says. "If a document is environmental, that's lifetime storage."
Because CVR's policy isn't finalized, the company hasn't gotten rid of any data. Brooks also says that once information reaches its retention limit, the company will start with a manual review to ensure the data should be destroyed. But his ultimate goal is to automate the destruction. "The manual intervention is where you get in trouble--everything becomes a judgment call," he says. "If the machine is doing it based on algorithms and parameters, at least your company can be consistent."
He's also aware of the need for legal holds. In the event of litigation, the plan is to use the Idol technology to search for relevant data and then move that information to a separate repository. Brooks' IT team also wrote agent software that moves data off corporate laptops and into the SANs whenever the laptops attach to the corporate network. When data is destroyed on the SANs, the agent also will erase it from the laptops.
EMC's Documentum content management system offers the Retention Policy Services module, which lets IT create folders that will enforce specific retention policies. Administrators can choose between automated and manual disposition when information reaches the end of its retention period, and the module supports legal holds to suspend disposition. Documentum licenses the Fast enterprise search engine (recently acquired by Microsoft) to index and search information.
Open Text's Enterprise Library Services, rolled out in October 2007, provides a retention and disposition policy layer across a variety of content repositories, such as archives, file systems, Microsoft SharePoint, and SAP. In December 2007, IBM announced a SOA-based connection between FileNet and the IBM Classification Module. The module automates the classification of unstructured content, including e-mail, through full-text analysis. In March, Hewlett-Packard announced it would acquire Tower Software, an Australian document and records management vendor, to expand its legal discovery and regulatory compliance capabilities.
Before the purchase, HP had included Tower's software in its Integrated Archive Platform, an archive appliance that serves as a central repository for a variety of data, including e-mail, Office documents, and SharePoint and Web content. Once inside the Integrated Archive Platform, the Tower software indexes and categorizes content so administrators can set up retention schedules. At the end of the retention period, the appliance destroys the data, essentially by writing over it in the repository.