Right in Time - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management

Right in Time

To survive today's challenges, businesses must take a long look at their data warehousing investment - and break traditional barriers to delivering strategic and tactical information faster.

Real-Time Integration

For a RTDW, data capture and integration is vital — and quite different from traditional systems. The goal of real-time data extraction, transformation, and loading (ETL) is to keep the warehouse refreshed, with minimal delay, as changes happen in the operational systems data. This may sound simple but complex architectural issues need to be considered carefully. The first is how the system identifies what data has been added or changed since the last extract. This is called changed data, or the "delta." Identifying the delta is simple if source transactions record a timestamp. If not, the job becomes more complex and difficult; you need special push-pull mechanisms, such as a message queue, database triggers, or streams.

The second issue is the performance impact of real-time extracts on the source system. Source systems are mostly transactional and mission critical; extracts clearly should not damage their performance. Finally, as the data is continuously loaded into the RTDW, key user perception issues, including how data is aggregated, must be managed carefully.

There are two basic techniques for real-time ETL: simulated real-time feed and trickle feed based on message queues. The choice of technique depends on the architecture, platforms, and performance requirements.

Simulated real time. The ETL objects in a traditional, batch-oriented data warehouse typically run either daily or weekly. To evolve to a RTDW, a simple and cost-effective technique is to increase the frequency of these batch runs to hourly or even more frequently by scheduling the extraction scripts accordingly. This technique is not truly real time, but simulates the effect and may be good enough for cases that don't require up-to-the-minute information. This technique requires minimal changes to existing ETL infrastructure and therefore is easy to implement.

Trickle feed. Trickle-feed techniques allow continuous update of the RTDW as the data in the source system changes. These techniques use a messaging infrastructure setup between the data sources and the real-time warehouse. A data pipe is set up between the source and the target and is perpetually open. This technique is also called "streaming" and is a form of data replication strategy. As Figure 2 shows, the basic elements of this technique are capture, stage, and apply.


Figure 2 Trickle-feed architecture for real-time data load.

Before the system can stream data using message queues from source to target, you must set up and configure the source and target databases. Setup may require use of special gateways if the source and target databases are not the same. With a source-capture process, the RTDW system can use "implicit capture"; that is, automatically capture changes to tables by monitoring and reading the system redo log files used for rollback and recovery. Such changes could either be to the table structure or the data content — but for the purposes of a RTDW, primary interest focuses on the latter.

The RTDW records changes as logical change records (LCRs) and queues them up in a persistent storage space: that is, a staging area. This area is typically a separate partition in the database. User applications can explicitly update this queue. Overhead incurred in monitoring and reading the redo logs is relatively low, which means that the impact on source system performance is kept to a minimum.

At the target database, a process takes the LCRs out of the message queue and the changes are applied to selected database objects. In a heterogeneous database environment, the RTDW system can apply LCRs to remote databases. The apply process is initiated frequently, typically once every few minutes, to propagate new data to the RTDW. Trickle-feed mechanisms such as message queuing and data streams also support any data transformation required before loading into the RTDW. Setting transformation rules in the message queues will take care of this step.

Message queues and streams thus provide the necessary capabilities needed for populating the real time data warehouse with ongoing data changes in the source system. They are fast, reliable, and cause minimal impact on the source application systems. However, real-time data integration does require more up-front development and can be complex to configure and manage.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Previous
2 of 3
Next
Comment  | 
Print  | 
More Insights
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
Commentary
Study Proposes 5 Primary Traits of Innovation Leaders
Joao-Pierre S. Ruth, Senior Writer,  11/8/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll