Content: The Other Half of the Integration Problem

Counting file systems, e-mail servers and disparate repositories, unstructured information is all over the place. Content integration consolidates search, access and management control, but which approach is best for your enterprise?

While JSR 170 may standardize the API for indexing and search, it doesn't standardize metadata. The crux of the content integration problem is that corresponding metadata elements in each repository have different field names and formats. AccountID in one repository might correspond to CustNum in another.

At a minimum, you must be able to translate between them, to query for content and aggregate search results. Content integration provides tools for metadata mapping and, in some cases, its own metadata dictionary and schema. To incorporate content in file systems and Web sites, some products offer rules-based metadata extraction (auto-tagging) from the content itself.

Search Everything

With metadata mappings defined, content integration lets you search across repositories in a single query. Usually this is accomplished by query brokering--translating the query from a "universal" language into each repository's search parameters and interrogating each source in parallel. This is trickier than it sounds: Not all repositories may support combined text-and-metadata queries, fuzzy searching and other middleware query language capabilities. EMC Documentum ECI Services, for example, provides mapping and filtering rules that translate queries into "nearest equivalents" supported by each specific repository. Some search engines offer a similar form of federated search, but content integration lets you manage the content, not just find it.

Note that metadata mapping and filtering must be provided not just for the query, but also for the results so they can be aggregated in tables or simply listed in a common format. Query brokering leverages the indexes of the repositories themselves, which are always up to date with the latest content.

Another way to perform federated search is to maintain a universal index within the content-integration middleware. This index is periodically updated by crawling the indexes (or raw content) of each repository. Google users are familiar with this approach. Because the processing is done in advance, queries and result list aggregation are faster, but the index may not include the newest content.

A few ECM vendors support both methods. IBM's WebSphere Information Integrator Content Edition, for example, uses query brokering, while WebSphere Information Integrator OmniFind Edition provides its own universal index.

Aggregate Views and Capture Events

Many content-integration offerings provide a virtual repository: a tree of virtual folders in which content items are aggregated from various repositories. The virtual repository doesn't replicate content and metadata; it merely provides an aggregated view. Virtual folders can be defined to support specific business processes, projects or activities. The folders can represent content queries or workflow inboxes. Moreover, they are dynamic, automatically updated when the content in the underlying repositories changes.

Because access to the virtual repository is usually over the Web, view services become a valuable content-integration feature. These include Web viewers for common content types, such as Microsoft Office, and on-the-fly converters for document image formats such as TIFF and MO:DCA files.

When content is added or updated or a deadline is reached, leading ECM repositories generate events that can be used to trigger actions based on business rules. For example, a mortgage loan application approval can trigger the printing and mailing of letters of congratulations as well as separate cross-selling processes for homeowner insurance.

Content integration extends this idea to the virtual repository. You can define rules that determine whether an event has occurred in any connected repository and then specify how these events will be handled. Event-triggered actions can be invoked on any content repository or workflow system connected via content-integration middleware. In some offerings, such as IBM WebSphere Information Integrator Content Edition, content events can even tie directly into the enterprise business-integration infrastructure, which can invoke Web services, J2EE components or JCA connectors to external business systems.

Reality Check

While content integration is powerful technology, it's still in the early adopter stage. Initial deployments of IBM WebSphere Information Integrator Content Edition have been led by federal government intelligence applications, but the vendor sees CRM as the next wave. Content integration can aggregate contracts and other documents with transaction data in a single SQL query, providing a "360-degree view" of the customer. Another promising avenue for content integration is compliance and records retention. A new IBM offering called Federated Records Management uses content integration to link multiple content stores with DB2 Records Manager under a classification and retention policy.

FileNet implementations have emphasized the imaging/workflow side. Customers use FileNet to extend the life of legacy image repositories by integrating them with newer business process management and compliance platforms.

Mobius Management Systems provides content integration in its ViewDirect Total Content Integration and ViewDirect Records Management offerings. Like FileNet, Mobius emphasizes fixed content such as host reports and document images, but the company offers connectors for Microsoft SharePoint, relational databases and other ECM repositories. FileNet and Mobius also emphasize the need to extend the records management, retention and compliance controls to all organizational content.

Content integration is EMC Documentum's fastest-growing product this year. Interest is strongest on the knowledge management and discovery side, with the common problem being that customers simply have too many places to search for what they need.

Day Software places content integration in the context of enabling managed migration from proprietary ECM to lower-cost, standards-based repositories. With the first JSR 170-compliant repository and ECI services built into the platform, Day wants to be the beneficiary of that migration.

Integration vs. Migration

Day's strategy reveals the changing economics of CM technology. Legacy content repositories, especially those designed for high-performance imaging and workflow, cost more to license and maintain than newer, generic content repositories. Integration expenditures — including middleware, connectors and related services — average $300,000, according to Forrester Research. Migration costs include the analysis, mapping and movement of content (and related services), plus new licenses less the difference in the maintenance costs between old and new repositories. As York International discovered, moving even a small amount of content can lead to unanticipated security and access problems (see the "Field Report"). If maintenance costs for the legacy repository are large enough, migration can make economic sense, but integration can solve the big problems quickly and let you migrate over time.

Wachovia provides a good example of how content integration lets enterprises manage the cost and challenges of repository reorganization. In early 2003, the banking and financial services company needed to either migrate or integrate diverse content repositories in several lines of business due in part to mergers and acquisitions. Each business had funded its own IT initiatives, but rather than replicate point-to-point integration projects, Wachovia used integration middleware. If Commercial Loans would fund development of the integration infrastructure, central IT would pay for its operation, confident that other lines of business would chip in their own funding as their own repositories needed to be integrated. Retail Brokerage soon joined the project, and other units now plan to follow. In first integrations completed before the end of the year, Wachovia's individual lines of business were able to integrate and migrate incrementally, affordably and consistently.

Draw Interest

Content integration is now dominated by the big ECM vendors, largely based on recent acquisitions. In the past year, IBM acquired Venetica, a supplier of middleware to several ECM vendors, and turned that software into the WebSphere Information Integrator Content Edition. EMC bought askOnce from Xerox, turning it into Documentum ECI Services. Oracle, which in August announced a major upgrade of its Collaboration Suite with ECM-oriented content, record and workflow services, acquired (also in August) ContextMedia, one of the few remaining independent content-integration vendors.

As we've seen before with electronic records management and team collaboration, when ECM giants snap up boutique technology startups, market awareness and demand for the new technology spike. While you may not have heard much about content integration yet, chances are you'll hear a lot more in the coming year.

Bruce Silver is president of Bruce Silver Associates ( Write to him at [email protected].