A next-generation computational approach is earning front-line operational relevance for data warehouses, long a resource appropriate solely for back-office, strategic data analyses. Emerging in-database analytics exploits the programmability and parallel-processing capabilities of database engines from vendors Teradata, Netezza, Greenplum, and Aster Data Systems. The programmability lets application developers move calculations into the data warehouse, avoiding data movement that slows response time. Coupled with performance and scalability advances that stem from database platforms with parallelized, shared-nothing (MPP) architectures, database-embedded calculations respond to growing demand for high-throughput, operational analytics for needs such as fraud detection, credit scoring, and risk management.
Data-warehouse appliance vendor Netezza released its in-database analytics capabilities last May, and in September the company announced five partner-developed applications that rely on in-database computations to accelerate analytics. "Netezza's [on-stream programmability] enabled us to create applications that were not possible before," says Netezza partner Arun Gollapudi, CEO of Systech Solutions. "Our engine for Profit Analytics generates and calls user-defined functions to compute complex functions based on a set of business rules. The resulting data mart build takes four to eight hours as compared to three to four weeks with traditional approaches." Gollapudi adds that even complex and multiple "what-if" scenarios can now be modeled and tested.
Netezza on-stream analytics is the basis for a "modeling server" from partner RateIntegration. The offering enables interactive development of custom, rules-based data transformations, directly executed as Netezza user-defined functions. "One of our telephony carrier-customers provides continuous real-time margin alerts to analysts around the globe, 7x24, based on online analysis of rated call data records from their global network," says Bert Dempsey, vice president for product management at RateIntegration. "Every call in the network is now incorporated into the analysis within 30 to 60 minutes of call completion."
As a result, business-critical pricing and margin analyses, and usage-based micro-segmentation of the subscriber base can be moved online, "instead of being costly, slow, and difficult offline analysis workflows," Dempsey adds.
In-Database is Nothing New
The in-database initiatives build on capabilities first commercialized in mid-'90s object-relational (OR) database systems from IBM, Illustra/Informix (now IBM), and Oracle. The OR systems let users create custom data types, index methods, and functions — for geospatial, textual, and time-series data, for instance. The vendors provided packaged DBMS extension modules, but user coding, which entailed SQL or C-language programming, never caught on.
The recent releases are designed to be easier to program than earlier OR implementations, and the parallelization greatly speeds code execution. Netezza, for example, has come up with an object-oriented development environment "that allows developers to concentrate on getting the algorithms right," says Netezza Vice President of Marketing Phil Francisco. The vendor furnishes an applications test bed, and it also provides development facilities and access to technology to customer, partner, and academic members of the Netezza Developer Network (NDN), which the company launched in September 2007. As of October 2008, Netezza counted more than 100 NDN members and 250 individuals trained to develop on the Netezza platform.
Teradata was founded in 1979, 21 years before Netezza's launch, and Teradata Chief Development Officer Scott Gnau boasts that the company has provided a framework for C-language database extensions since 1993. He points to high-performance, database-embedded encryption/decryption capabilities implemented by Teradata partner Protegrity as an illustration of the speed that can be realized with database-embedded processing. Protegrity claims in-database performance of more than 6 million decryptions and more than 9 million encryptions per second.
Teradata is partnering with SAS to implement in-database analytics. The companies' partnership, announced a year ago, will bear fruit with first-half 2009 general-availability of the Teradata 13 platform and major parts of the SAS 9.2 release in the first quarter. The releases will include versions of a number of SAS algorithms recoded, some with SQL and others with user-defined functions and types, to take full advantage of Teradata's parallel architecture. Gnau adds that the partnership has already led to the release of in-database scoring of data-mining models exported from SAS.
"The SAS-Teradata partnership is such a cool thing because it bridges a cultural divide between the humans and the Vulcans — the database guys and the analytics types," says Gnau.
SAS says neural networks and linear regressions, as well as a series of base-SAS procedures, are slated for Teradata optimization with SAS 9.2. A shift in data-integration from extract-transform-load (ETL) to extract-load-transform (ELT) is also planned. That change will push computationally intensive data manipulation into the database. Other candidates for Teradata optimization include SAS Risk Management, which would be ported to use Teradata's Financial Services Logical Data Model (FSLDM). Credit/risk analyses for money laundering detection are also ideal candidates for database-embedded analytics, according to SAS.
Both SAS and Teradata have other development partners. For instance, Teradata's Gnau mentions data-mining vendors KXEN and Visual Numerics in the analytics arena, while SAS Global BI Product Marketing Manager Tammi Kay George says a long-standing SAS relationship with Teradata-competitor Netezza could potentially result in similar embedding of analytics. For now, the SAS-Netezza analytics-DBMS interface is limited to use of SAS/Access, optimized for the Netezza platform, to tap Netezza data sources.
Upstarts Join the Race
Greenplum and Aster Data Systems are newer but fast-growing entrants in the MPP data-warehousing market. Both claim petabyte-scale capabilities, and both implement a MapReduce framework — notably employed by Google to enable indexing and search scalability, for parallelization. (MapReduce, in the form of the open-source Apache Hadoop project, is the basis for the Hbase open-source, distributed, column-store database system.)
Greenplum executives claim that any developer can program in-database analytics for the Greenplum platform, and they estimate that 20 percent of customers are using Greenplum-embedded analytics. Greenplum CTO Luke Lonergan describes implementing advanced analytics tools including the routines from the BLAS and LINPACK linear algebra libraries and MLinReg multilinear regressions from Statpak as well as an initiative to embed the open-source R statistical programming language in Greenplum. (In fact, the open-source PostgreSQL database system, on which Greenplum is based, has procedural language bindings for R, Python, Perl, and other programming languages that Greenplum can exploit.)
Greenplum claims to have more than 50 customers, including MySpace, eBay, the New York Stock Exchange, and Sun Microsystems. "Our customers are building their own enterprise-analytics applications," says Greenplum Marketing Vice President, Paul Salazar, "and we're trying to make it easy for them." In addition to including the MapReduce and "programmable parallel analytics" implementations, Greenplum 3.2, released in September, gained in-database compression and enhanced database-monitoring capabilities.
Aster Data Systems touts the capability of its Aster nCluster 3.0 analytical database, released in October, to support "frontline" functions including credit scoring, behavioral ad-targeting, fraud detection, spam denial, recommendations, and risk modeling. CEO Mayank Bawa says the company has parallelized commonly used algorithms that provide for sequential pattern analysis and other transformations on live data. "That means you can do modeling inside the DB without exporting to SAS or other software," Bawa says. The executive points to linear regression, time-series modeling, and k-means clustering as examples of Aster-provided algorithms.
Bawa says Aster's nCluster implementation of a MapReduce parallelization framework provides a "procedural programming paradigm" that supports SQL-invoked execution of off-the-shelf and user-programmed functions. "We have taken enormous care to make the process easy," says Bawa, adding that in-database code encapsulation ensures DBMS stability.
Netezza may have been the first to ship parallel, database-embedded analytics, but rivals were quick to launch competing technologies. More solutions and expanded partnerships are prominently on the roadmap for Aster, Greenplum, Netezza, and Teradata. The advances will help customers tap data warehouses for front-line, operational analytics, and they will help keep pressure on data-warehousing vendors including Oracle and Microsoft that do not yet offer database-system parallelization.