SAS made a slew of announcements at its annual user conference in Las Vegas last week, but none was more important than the news around its High-Performance Analytic (HPA) Server. Of particular importance was the news that HPA will work with Apache Hadoop, the fast-growing big-data processing platform.
HPA isn't currently something that most SAS customers use, much less hope to use in conjunction with Hadoop. But HPA is a cutting-edge product that is crucial to the company's future. Making HPA run on Hadoop is a key step to bringing SAS' vast portfolio of analytic capabilities into the open-source-dominated big data world where data scientists are writing their own algorithms, embracing big-data focused startups, or adapting open-source code written in the R programming language.
SAS is using an agile development approach with HPA so it can quickly expand upon its capabilities and adapt to the pace of the big data world. "We're working on an open-source timeline," Tapan Patel, a SAS product marketing manager told InformationWeek. "The Hadoop and R communities are making so many changes, so we have to adapt."
HPA already runs in the relational world on EMC Greenplum and Teradata. By using the massively parallel processing (MPP) power of these platforms for in-database analysis, analysts can save hours if not days over the old approach of moving data sets out of a data warehouse, analyzing it on a dedicated (but often underpowered) analytic server, and then moving the results back into the data warehouse.
[ Want more on in-database analysis alternatives? Read IBM Answers Oracle Exadata. ]
Other vendors have adopted the in-database approach, including IBM and Oracle as well as EMC Greenplum and Teradata. And these database suppliers are working with SAS and SAS rivals including Alpine Data Labs, Fuzzy Logix, Revolution Analytics, and others to broaden the analytics they can apply within their databases.
SAS was an early pioneer of in-database work, and with HPA it was already supporting predictive analytics and data mining in partnership with EMC and Teradata. With the latest release announced last week (and now shipping), HPA has added text mining, optimization, and forecasting capabilities.
Text mining makes sense of text-rich information such as insurance claims, warranty claims, customer surveys, or the growing streams of customer comments on social networks. Optimization helps retailers and consumer goods makers, among others, with tasks such as setting prices for the best possible balance of strong-yet-profitable sales. Forecasting is used by insurance companies, for example, to estimate exposure or losses in the event of a hurricane or flood.
Where Hadoop is concerned the latest release already runs on the platform, technically, but it's limited to a SAS-customized version of the open source software based on Apache Hadoop v1.0 (also known as version 0.20.20x). SAS says HPA will run on mainstream distributions of Hadoop from the likes of Cloudera, with an upcoming December release of HPA that will based on Apache Hadoop v2.0 (also known as version 0.23).
Whether you're using SAS's current Hadoop software or plan to embrace the v2.0 release, HPA provides a graphical user interface that lets you tap HDFS, MapReduce, Pig, and Hive to apply SAS analyses to the vast data sets residing on Hadoop. MapReduce is the primary model for processing data on Hadoop. Pig is an open source Apache programming tool and language for writing MapReduce jobs. Hive is data warehousing infrastructure built on top of Hadoop that supports data summarization, query, and analysis. HPA also supports Pig and MapReduce code generation, visual editing and syntax checking. Finally SAS Data Integration Studio data transformations and SAS DataFlux data quality routines have also been adapted to Hadoop.
The key question is whether Hadoop practitioners, who may now be used to using open-source and home-grown analytics, will want to bring a commercial product like SAS into what many view as a new computing paradigm.
"We're going open source as a company, so our skill set has had to change over the last three years," says Phil Shelley, vice president and chief technology officer at Sears Holdings. The move started with operating systems, with a move toward Linux, but the change has moved up the stack to the database and analytics level. "Our statistical people used to just use SAS and other [commercial] products, but now we're teaching them to use R on Hadoop," Shelley says.
Cost will certainly be a software selection factor as that's a big reason companies are adopting Hadoop; they're trying to retain and make use of all their data, and they're expecting cost savings over conventional relational databases when scaling out over hundreds of Terabytes or more. Sears, for example, has more than 2 petabytes of data on hand, and until it implemented Hadoop two years ago, Shelley says the company was constantly outgrowing databases and still couldn't store everything on one platform.