The New Economy, like its predecessors, proved unable to overcome business cycles. However, there's no question that economic developments in recent years have dramatically changed the global business landscape. Deregulation and globalization have removed many business barriers. To survive competitively, many organizations have had to reinvent themselves.
Transactional applications deployed have become silos of information, and do not meet the business intelligence demands of analysts and users functioning under different conditions. Collating, integrating, and reporting on data spread across heterogeneous platforms and applications remain major stumbling blocks in information access and delivery.
To address this challenge, many organizations have developed an enterprise data warehouse (EDW). The EDW architecture integrates data spread across transactional systems into a central repository, against which users may perform business analytics. The EDW is then loaded periodically with new data from operational systems. Reports, mostly strategic in nature, came to business users after an acceptable time lag. Overall, the EDW approach has served business needs well.
However, the recent economic slowdown marked by business contraction and deflation has forced companies to work hard at optimizing business processes, improve efficiency, and do everything possible to reduce costs. Most have an urgent need to maximize return on past capital investments, leveraging the most out of infrastructure systems already in place. What is the best way for companies to gain ROI from their data warehouse investments?
First-generation data warehouses focused on reporting; the second generation brought online analytic processing (OLAP) and data mining. Sophisticated, multidimensional OLAP tools came to the market, which vastly eased complex data analysis through features for drilling down and pivoting on data, building and applying models for predictive analysis, and using the Web to publish reports.
The new generation, in response to current economics and business requirements, is focused on doing what is necessary to speed information cycle time, drive out information latency, and thereby enable users at all levels to be more effective. Information on demand is what businesses need. A real-time enterprise has to have up-to-date information for decision-making and for optimizing critical business processes.
The requirement to deliver information on demand for active decision support is fueling the need for real-time (or perhaps more appropriately called "right time") data warehouses, business activity monitoring (BAM), and alerting. These advances enable organizations to more efficiently execute business strategy by making optimal use of data resources. The real-time data warehouse (RTDW) focuses on cutting down latency in the decision-making process by empowering a wide spectrum of users. Information overload is reduced through monitoring business performance in the background and only reporting back exceptions to norms. RTDW also positions information better for use by downstream applications, such as customer relationship management.
How does the architecture of a RTDW differ from that of a traditional data warehouse? I should first say that an RTDW does not need to be built in complete isolation; organizations with traditional data warehouses in production can extend them with help from suitable tools and techniques. Most organizations will evolve their data warehouses toward hybrid architectures, such that shown in Figure 1. How much and how well you can extend the traditional system depends on the scope of "information on demand" requirements and the level of latency that your organization deems acceptable. If careful considerations are not made about this scope, the costs of moving to an RTDW can rise.
Here are some key differences between a RTDW and a traditional data warehouse:
Tactical vs. strategic. A traditional data warehouse is inherently passive in nature; it only supports strategy formulation by selected users. It enables long-term planning and strategizing by looking at the historical trends. A RTDW focuses on improving tactical decision support and the execution of the strategy.
Real time vs. batch. As the name suggests, a RTDW delivers information on demand. It integrates the latest data from transactional systems with historical or contextual data to provide the most up-to-date view of the business. The queries work on lower data volumes, analyzing trends for isolated business cases. By contrast, a traditional data warehouse is batch-oriented; it is used for offline analysis. The most demanding queries in a traditional warehouse work on very large volumes of data to hunt for hidden patterns often across the entire spectrum of a business dimension.
Integrated vs. isolated. A major intention of a RTDW is to integrate data warehousing with business processes and the application systems supporting them. Analytical CRM applications offer a classic example. A traditional data warehouse would perform customer segmentation, scoring and life-time value computations as isolated batch processes and this may be good enough for most segment-based marketing. However, businesses that desire highly interactive, personalized customer conversations need to have integrated, real-time analytics to support their efforts to establish positive customer interaction.
Guaranteed vs. best effort. A traditional data warehouse generally operates offline and delivers only on a best-effort basis. It guarantees neither availability nor performance. This is acceptable, as some delay in delivery of strategic information does not significantly erode the value of the information. In contrast, a RTDW must support active, ongoing decision-making and therefore needs to guarantee both availability and performance. If the active data warehouse consistently fails to deliver, the negative impact on the business could be significant.
For a RTDW, data capture and integration is vital and quite different from traditional systems. The goal of real-time data extraction, transformation, and loading (ETL) is to keep the warehouse refreshed, with minimal delay, as changes happen in the operational systems data. This may sound simple but complex architectural issues need to be considered carefully. The first is how the system identifies what data has been added or changed since the last extract. This is called changed data, or the "delta." Identifying the delta is simple if source transactions record a timestamp. If not, the job becomes more complex and difficult; you need special push-pull mechanisms, such as a message queue, database triggers, or streams.
The second issue is the performance impact of real-time extracts on the source system. Source systems are mostly transactional and mission critical; extracts clearly should not damage their performance. Finally, as the data is continuously loaded into the RTDW, key user perception issues, including how data is aggregated, must be managed carefully.
There are two basic techniques for real-time ETL: simulated real-time feed and trickle feed based on message queues. The choice of technique depends on the architecture, platforms, and performance requirements.
Simulated real time. The ETL objects in a traditional, batch-oriented data warehouse typically run either daily or weekly. To evolve to a RTDW, a simple and cost-effective technique is to increase the frequency of these batch runs to hourly or even more frequently by scheduling the extraction scripts accordingly. This technique is not truly real time, but simulates the effect and may be good enough for cases that don't require up-to-the-minute information. This technique requires minimal changes to existing ETL infrastructure and therefore is easy to implement.
Trickle feed. Trickle-feed techniques allow continuous update of the RTDW as the data in the source system changes. These techniques use a messaging infrastructure setup between the data sources and the real-time warehouse. A data pipe is set up between the source and the target and is perpetually open. This technique is also called "streaming" and is a form of data replication strategy. As Figure 2 shows, the basic elements of this technique are capture, stage, and apply.
Before the system can stream data using message queues from source to target, you must set up and configure the source and target databases. Setup may require use of special gateways if the source and target databases are not the same. With a source-capture process, the RTDW system can use "implicit capture"; that is, automatically capture changes to tables by monitoring and reading the system redo log files used for rollback and recovery. Such changes could either be to the table structure or the data content but for the purposes of a RTDW, primary interest focuses on the latter.
The RTDW records changes as logical change records (LCRs) and queues them up in a persistent storage space: that is, a staging area. This area is typically a separate partition in the database. User applications can explicitly update this queue. Overhead incurred in monitoring and reading the redo logs is relatively low, which means that the impact on source system performance is kept to a minimum.
At the target database, a process takes the LCRs out of the message queue and the changes are applied to selected database objects. In a heterogeneous database environment, the RTDW system can apply LCRs to remote databases. The apply process is initiated frequently, typically once every few minutes, to propagate new data to the RTDW. Trickle-feed mechanisms such as message queuing and data streams also support any data transformation required before loading into the RTDW. Setting transformation rules in the message queues will take care of this step.
Message queues and streams thus provide the necessary capabilities needed for populating the real time data warehouse with ongoing data changes in the source system. They are fast, reliable, and cause minimal impact on the source application systems. However, real-time data integration does require more up-front development and can be complex to configure and manage.
To achieve a RTDW, you will need to extend traditional data warehouse database designs to accommodate continuous data trickle feeds intermixed with live user queries. The majority of the changes are around schema design, active partition management, and data aggregation. The best course of action is to develop an architecture that can handle the storage of real-time data together with the static historical data on a case-by-case basis. There are a number of possibilities. You could choose to store real-time data together with the static data in the same fact table or the system could store it in a separate table. Your decision depends on the querying and alerting needs of the downstream analytic applications.
The RTDW system could store trickle-feed data in a separate partition in the data warehouse, which is typically called an "active," or delta, partition. While updates are occurring, this active partition is either offline or not visible to all user queries. At a certain predefined interval, typically once every few minutes, the system renames the active partition so it may be merged with the data warehouse tables.
Once merged, queries can work against the new data. This process is often called "flipping" the real-time data. The active partition may be physically consumed in the process of data flipping; if this is the case, the system will open a new active partition to accommodate the next load of continuous updates. The exact technique depends on the capabilities available in the data warehouse database.
Active partitions enable queries to work with constant data snapshots. As the data keeps changing in the data warehouse fact tables, the system must also update any aggregates or summary tables that have been created for performance reasons. The strategy shown in Figure 3 offers incremental aggregation.
If necessary, the RTDW system can intentionally delay partition flipping until an end-of-day batch load. With this strategy, alerts, activity monitoring, and selected real-time reporting can work against the active partition, while most of the other user queries continue to work against the static partition in the data warehouse.
RTDWs extend traditional warehouse querying and reporting services and enable a completely new breed of downstream applications that are developed to provide online, real-time decision support, activity monitoring, and alerting. Making such applications succeed requires tight integration of the RTDW with existing information delivery channels in an organization, such as enterprise applications, portals, and multichannel devices. Middleware deployed must support all delivery mechanisms. Alas, many querying and reporting products that require bulk offline data transfer from the warehouse to the product's engine for processing are not suitable in a real-time environment.
Generally, existing enterprise applications can work with the RTDW's information. Such applications are often deployed to a vast number of users, and cover many customer, supplier, and employee touchpoints. Therefore, enterprise applications present an excellent opportunity for a high-impact, ROI-delivering RTDW. A CRM application, for example, could benefit from real-time reports about a customer's propensity to buy a product while he or she is on the phone or visiting a company Web site.
RTDWs could also aid in performance management: for example, where an employee wants to set key performance indicators (KPIs) on his or her portal home page to manage by objective. The RTDW could refresh the KPI data periodically and alert the employee (potentially through a pager or other mobile device) when certain critical conditions exist. Such alerts could be emailed to the employee for less critical situations. In this way, the RTDW enables continuous information dissemination from the data warehouse to a broader user community.
Given current business requirements, a RTDW is a natural evolution from the traditional EDW. Extending the traditional data warehouse with real-time data will help organizations meet the need to reduce information latency, a key factor for successful real-time enterprises.
Promising better integration with enterprise business applications, RTDW can help organizations meet more aggressive service levels and deliver business intelligence to a wider community of knowledge workers, who depend on information to execute business strategies. Finally, by tapping into existing systems effectively, the RTDW will allow many organizations to maximize the potential of their data investments.
Rajesh Gadodia [[email protected]] is a senior solutions manager with Oracle Corp.'s BI & DW Asia Pacific Division.