The Active Data Warehouse

Stephen Brobst, CTO of NCR's Teradata division explains why you need an active data warehouse.


How are complications from increasing data granularity, data and query volumes and frequency of changes impinging business?

Gartner Group research predicts that by 2012, organizations will be dealing with 30 times more data than they are today. But some of this data is unstructured content, so not all of it gets into the data warehouse.

What I see with customers is that detailed data is getting more detailed. Ten years ago, telcos thought of detailed data as the billing history. Today, in the data warehouse, if you don't have call detail records, you're not even competitive. Ten years ago, if you said you were going to put call detail in the data warehouse, people would say you were crazy. The economics and the technology and competition are forcing this scalability issue. I predict that in the future, call detail records won't be the detail; with VOIP, it'll be the network packets.

One of our biggest customers is FedEx. With RFID, the dozen or so scans it currently runs on packages could turn into hundreds of scans. Although RFID hasn't made its way into the low-end, high-volume retail stuff yet—because of the economics—it will.

What are some developing RFID applications that involve real-time analysis?

In the shipping example, there's the ability to track individual packages and re-route them in transit. Also, the ability to do operational controls, so if you see jam-ups in certain places, you can see where the packages are bottlenecking and go do something about it.

Is data warehousing inappropriate for making timely decisions based on streaming "real-time" data?

Service levels are evolving very quickly in terms of performance and how up-to-date the data needs to be in the data warehouse. What you need is an "active" data warehouse. You need the current data, but you also need the historical data, because you have to look at it in the context of history to make good decisions. Some decisions made with streaming data require only local data, but those are the easy ones. Look at fraud detection. One activity profile that alerts to a possible stolen credit card during stream processing shows that a card is used at a pay-at-the-pump gas station or in another non-human-facing transaction (because the thief is testing to see if the card is good) and then large transactions are made for consumer electronics, jewelry or some other types of merchant codes. As you get more sophisticated, you realize that profile doesn't actually work very well. Taking myself as an example, I only buy gas through pay-at-the-pump and sometimes go on consumer electronics buying binges. If you shut off my credit card, I'd be pretty pissed off.

Most people frequent five or fewer gas stations. So fraud detection should involve not just merchant category codes, but also merchant IDs and the historical data about which merchants an individual uses.

So when you go to tactical decision support, it's true you're going narrower, because you're looking at fraud for your credit card, not everyone's, but you still need to go deep, historically. The rules for raising fraud alerts will change over time, but those changes will come from evaluating the success of the fraud detection.


On travel: In the past 10 years, only twice have I been in one city for more than five consecutive days. Once was in New York City during the terrorist attacks of Sept. 11, 2001 because I couldn't get out. My last [personal] international trip was a few weeks ago on Lake Titicaca in Bolivia. I walked about 20 miles a day for three days, starting at Copacabana and ending at the temple of the sun god on Isla del Sol. Last weekend, I climbed Long's Peak in the Rocky Mountains.