A new class of streaming data software, responding to what may soon be a common demand, is melding deep analytics with the ability to crunch torrents of data in real time. Analyzing continuous, high-volume data feeds poses a special challenge for applications as varied as automated financial-market trading, security-incident detection and weather forecasting. These applications all use analytically discovered patterns to generate predictions, yet the value of these predictions is degraded by long processing times.
Until recently, if you needed real-time results, you had to settle for simple analyses such as scoring, where you plug up-to-the-second numbers into canned models and — based on the outputs — fire off alerts, make routing choices or make thumbs-up or thumbs-down decisions. Streaming data solutions offer more sophisticated analyses.
Take securities-trading data, which, to simplify, consists of streams of ticker symbols and prices, lot sizes, times of the last trade, and bids and offers. NASDAQ alone hosts trading in 3,300 companies with billions of daily price quotes and trades. Apama, a U.K.-based vendor, is focusing on this market by offering systems that filter, join and analyze market data feeds. These solutions support "algorithmic" securities trading that applies complex, adaptive market models.
StreamBase, a company founded by database pioneer and Ingres/Postgres inventor Michael Stonebraker, is taking a general-purpose approach to securities trading. He characterizes real-time, complex processing as "a very different challenge from the stored data challenge solved by relational databases." The stored-data approach involves assimilating operational data into warehouses that host many simultaneous, diverse data analyses or, alternatively, into smaller marts refined for narrower analyses. Regardless of scale, data acquisition involves time-wasting data cleansing, loading and indexing. Analytically structured databases simply can't keep up with the flood of data that a single stream may deliver.
Streaming data arrives when it's ready — irregularly and unpredictably. While point-in-time values matter, the data may contain important patterns that can be discerned only by looking at "time windows" rather than points and only by correlating data from multiple sources. In the securities industry, there's often interest in trends among tracked entities relative to comparators and historic patterns. Traders might want to detect anomalies that hint at risks and opportunities, perhaps fleeting, to either hedge or exploit. Meanwhile, most of the data in a feed might be extraneous and should be filtered out before analysis. Imagine sipping water from a fire hose.
Notable streaming-data projects have emerged from industrial and academic research labs. RiverGlass is a security- and financial-risk-oriented commercialization of technology created at the University of Illinois to federate and detect patterns in heterogeneous document and data streams. Hancock from AT&T Labs Research is designed to monitor communications traffic. Intel Research Berkeley is developing software to handle large arrays of environmental, security and tracking "mote" sensor devices that produce data streams. And Coral8, a startup that leverages research done at Stanford University, is similarly targeting sensor-data analysis as well as financial, security-incident and operational intelligence applications that rely on continuous detection, calculation and analysis.
Like StreamBase, Coral8 provides an extended version of standard SQL designed for long-running, "incremental" queries over continuous data streams as well as querying conventional stored, relational data. Stanford University's Continuous Query Language (CQL) is another variation. These query languages bring familiar SQL syntax, such as subqueries, joins and new operators, to data streams.
Streaming-data technology is likely to gain market acceptance much more rapidly than relational database systems did. It targets a critical need for complex, real-time processing — a need that isn't met by sluggish (by comparison) RDBMS-reliant approaches or by activity-monitoring systems with shallow analytics. And the technology can query both streaming and conventional relational data, easing integration, first in the securities-trading niche but soon in a spectrum of applications that could benefit from data-intensive, real-time operational analytics.
Seth Grimes is a principal of Alta Plana Corp., a Washington, D.C.-based consultancy specializing in large-scale analytic computing systems. Write to him at [email protected].