Business intelligence (BI) and data warehousing professionals: Your comfort zone is shrinking. Dramatic changes lie ahead. Conventional applications are giving way to standards-based Web services. The lines between operational and analytic systems are blurring. And as organizations try to make sense out of what's contained in their myriad data stores, standard data integration and presentation techniques are proving costly and difficult.
The bottom line is that current practices aren't adequate. Business agility depends on the ability to assemble, disassemble and rearrange application components. These actions require a comprehensive understanding of not only data representation ("syntax"), but also data's meaning and its relationships to other data and information — that is, the "semantics." In this article, we'll look at how we're moving from data integration toward higher-level semantic integration.
When you can't effectively integrate information, valuable opportunities are lost — and fraud, regulatory or security threats go unrecognized. The right information in the right hands at the right time can lead to the right action by the right people at the right moment.
Unfortunately, current techniques limit the breadth and depth of integration. Valuable data and content lie buried in technology silos that are difficult to bridge.
For enterprise content integration (ECI), potential solutions include choices that go beyond mere connectivity to support sophisticated indexing, search and virtual repositories. XML, JSR 170 and other standards figure prominently. ECI implementation is still at the early adopter stage; vendors are jockeying for future dominance as organizations place increasing priority on cross-functional business processes and real-time information delivery.
Integration at a higher level — at something approaching natural language — holds great promise, especially as "Semantic Web" standards evolve. Semantic integration is a key strategic direction for both structured data and unstructured content integration because it focuses on meaning: that is, how pieces of information relate to each other. This is significant not only to human use of information but also to computer-based processes, applications and Web services that must respond in real time to customer activity and other business events.
From ETL to Chaos
Over the past 15 years, we've seen a sequence of integration technologies and methodologies emerge, flourish and hang on. First, extract, transform and load (ETL) delivered data integration and movement for data warehouses. Then enterprise application integration (EAI), along with message-oriented middleware, opened the door to business-to-business Web commerce. Now there's a surge of interest in enterprise information integration (EII) among those looking to do real-time operational reporting and other time-sensitive activities. By supporting the delivery of queries to the data sources rather than waiting on ETL and data movement steps to get the data into the warehouse, EII addresses a weakness of conventional data warehousing when it comes to real-time objectives.
While each approach has its attributes, this collection of technologies can't be amalgamated to provide the level of information integration most organizations need. Analysts and vendors frequently suggest that ETL, data warehouse, EAI, EII and other integration tools are complementary. In other words, even if you apply them separately, you'll ultimately arrive at a complete solution. They're mistaken.
Each technology crevice between the different tools requires a separate modeling and mapping effort, leaving organizations with multiple models. ETL demands a target database schema. EAI requires agreement on a canonical form among the applications. And with the less mature EII, technical demands vary from vendor to vendor — from a simple set of views to a full model developed in Unified Modeling Language (UML). From a management perspective, each integration solution tends to operate from within a different technology stack and, therefore, carries unique design constraints, tuning characteristics and vendor upgrade cycles.
Finally, integration tools generally don't expose their metamodels or use other means of communicating. Certainly, none integrates data beyond the syntactic level. The tools manage semantics — information that focuses on conveying the meaning of data and information — in ad hoc fashion, if at all. Because each tool manifests itself differently to those who use the information, it's unlikely that a knowledge worker could see a single view that combined the fruits of ETL, EAI and EII. Rather, each tool would more likely be an element of distinctly separate application or information architectures.
Say a field service representative wants to check on the status of a product's shipment. Through a logistics dashboard, the rep would like to see where action is necessary to resolve a bottleneck. An EII tool might generate a federated query to three separate operational systems to refresh the dashboard's real-time display. Now able to see the problem, the rep must turn to a separate BI interface to access the data warehouse and extract performance and pricing records for potential suppliers. Then, to reach out to the contracted supplier, the rep must switch to a third application and use another interface supplied by the EAI and messaging middleware.
The time has come to bring disparate sources of information together to make life easier and business conditions more understandable for employees up and down the line — employees whose expectations are rising with exposure to BI, portals, dashboards and other interfaces. We'll find the successor to the current dysfunctional set of tools in semantic integration, also known as "ontology."
But first, consider some of our cherished concepts of data warehousing: Do they need retooling, or even retirement?
Single Version of the Truth
As it has matured, the dynamism that once characterized data warehousing has solidified into cured concrete. Orthodoxy and best practices reign; the upside, of course is that the risks that once came with experimentation and learning-by-doing are mostly gone. The big debates — such as relational versus multidimensional OLAP and third-normal form versus star schema — have all but ended.
The unifying goal today is the "single version of the truth." SVT springs from the original motivation for data warehousing: Establish one place for all the data needed for managerial reporting, thereby alleviating the need to query hard-pressed operational systems directly. The data warehouse has let companies keep a longer historical record — which is built through data integration and ETL — and eliminate, or at least hide, multiple, conflicting meanings, syntax and definitions that exist among source systems. The SVT ideal is to have a data warehouse that contains the one true, inviolate set of enterprise data. Unfortunately, this ideal remains just that for most organizations. Typically, the data warehouse merely adds one more source of meaning.
To create an SVT resource, standard operating procedure is to gather representatives from various user constituencies and develop a data vocabulary. However, as the size of this congress increases, the peaks and valleys of the "truth" get flattened by compromise. And as you add constituents, the effort takes longer. Difficult and painful revisions and arcane expansions on the definitions happen continuously. People who are supposed to be served by the SVT data warehouse become disaffected and seek alternatives. Suddenly, the real work is taking place downstream, in data marts, in the persistent data stores of BI and analytic tools and in spreadsheets. The result is an all-too-familiar alchemy: SVT becomes MVT, or many versions of the truth.
Take the effort to define "customer." To the accounts receivable department, a customer is anyone who owes money or has paid a bill. The sales organization sees a customer as any entity likely to place an order this quarter. Marketing defines a customer as anyone who might ever place an order.
More interesting variations on the "truth" happen when there are different points of view about derived values, especially performance metrics. "On-time percentage" could have a hundred different meanings. So could "customer profitability." A data warehouse is useful for tracking raw components that feed such calculations and metrics; however, few methodologies address what to do about multiple derivations in downstream applications.
Today's source systems weren't built by a few programmers in the IT department: ERP, CRM and other transactional systems involved massive software engineering and thousands of person-years of development. While it was once a romantic notion that a single collection of relational data could be the inviolate repository, time has shown that tools and techniques are largely inadequate to the task. Plus, the SVT idea may be flawed anyway; there are and always will be multiple versions and contexts for data as well as derivations and presentations of data. (Editor's Note: SVT is not the same as single source of truth; for a perspective on this issue, see Letter Drop.)
Metadata: Find the "Go-TO" Superuser
BI tools typically offer limited or proprietary capabilities for capturing and conveying "metadata" — the data about the data that provides schema, table, index and other definitions and context. Metadata has become like world peace: If you want to win the pageant, you'd better mention it in your speech. Alas, the meaning of metadata is just as elusive. There is metadata that describes ETL operations, metadata that defines source and target data mapping, and metadata that defines the data elements in a natural language. Database management systems have metadata of sorts in their catalogs. And front-end BI tools offer metadata to describe transformations beyond ETL (such as table or attribute to hierarchy) and to perform calculations, format reports and help define security privileges. All these pieces are useful, if not essential, but they don't give you a unified view of the data landscape. To the extent that they inform knowledge workers directly, the metadata is too sparse to be valuable.