In-Database Analytics: A Passing Lane for Complex Analysis

What once took one company three to four weeks now takes four to eight hours thanks to in-database computation. Here's what Netezza, Teradata, Greenplum and Aster Data Systems are doing to make it happen.
Upstarts Join the Race

Greenplum and Aster Data Systems are newer but fast-growing entrants in the MPP data-warehousing market. Both claim petabyte-scale capabilities, and both implement a MapReduce framework — notably employed by Google to enable indexing and search scalability, for parallelization. (MapReduce, in the form of the open-source Apache Hadoop project, is the basis for the Hbase open-source, distributed, column-store database system.)

Greenplum executives claim that any developer can program in-database analytics for the Greenplum platform, and they estimate that 20 percent of customers are using Greenplum-embedded analytics. Greenplum CTO Luke Lonergan describes implementing advanced analytics tools including the routines from the BLAS and LINPACK linear algebra libraries and MLinReg multilinear regressions from Statpak as well as an initiative to embed the open-source R statistical programming language in Greenplum. (In fact, the open-source PostgreSQL database system, on which Greenplum is based, has procedural language bindings for R, Python, Perl, and other programming languages that Greenplum can exploit.)

Greenplum claims to have more than 50 customers, including MySpace, eBay, the New York Stock Exchange, and Sun Microsystems. "Our customers are building their own enterprise-analytics applications," says Greenplum Marketing Vice President, Paul Salazar, "and we're trying to make it easy for them." In addition to including the MapReduce and "programmable parallel analytics" implementations, Greenplum 3.2, released in September, gained in-database compression and enhanced database-monitoring capabilities.

Aster Data Systems touts the capability of its Aster nCluster 3.0 analytical database, released in October, to support "frontline" functions including credit scoring, behavioral ad-targeting, fraud detection, spam denial, recommendations, and risk modeling. CEO Mayank Bawa says the company has parallelized commonly used algorithms that provide for sequential pattern analysis and other transformations on live data. "That means you can do modeling inside the DB without exporting to SAS or other software," Bawa says. The executive points to linear regression, time-series modeling, and k-means clustering as examples of Aster-provided algorithms.

Bawa says Aster's nCluster implementation of a MapReduce parallelization framework provides a "procedural programming paradigm" that supports SQL-invoked execution of off-the-shelf and user-programmed functions. "We have taken enormous care to make the process easy," says Bawa, adding that in-database code encapsulation ensures DBMS stability.

Netezza may have been the first to ship parallel, database-embedded analytics, but rivals were quick to launch competing technologies. More solutions and expanded partnerships are prominently on the roadmap for Aster, Greenplum, Netezza, and Teradata. The advances will help customers tap data warehouses for front-line, operational analytics, and they will help keep pressure on data-warehousing vendors including Oracle and Microsoft that do not yet offer database-system parallelization.