Pentaho Preps Data On Hadoop, Analyzes On MongoDB

Pentaho 5.1 adds YARN support to support predictive analysis, transforms JSON for analysis on MongoDB.

Doug Henschen, Executive Editor, Enterprise Apps

July 8, 2014

3 Min Read

10 Big Data Pros To Follow On Twitter

10 Big Data Pros To Follow On Twitter


10 Big Data Pros To Follow On Twitter (Click image for larger view and slideshow.)

Open-source analytics and data-integration software provider Pentaho burnished its big-data credentials with newly released support for Hadoop's YARN management layer and the popular MongoDB NoSQL database.

The new capabilities, released in late June in Pentaho's 5.1 release, enhance the company's already strong presence in the world of big data by bolstering data preparation for predictive analysis.

"The big 'aha' in getting a return on analyzing a petabyte of information is being able to predict what the customer is going to do next, whether that's buy something, commit fraud, or churn," said Quentin Gallivan, Pentaho's CEO, in a phone interview with InformationWeek. "Our vision is to befriend the data scientist by building a studio where they can orchestrate and profile their data and then use their tool of choice for prediction."

[Want more on big data analysis? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

Support for YARN, the management layer introduced in Hadoop 2.0, is crucial to that vision because it enables Pentaho's analytics studio to operate directly on top of all the data stored in Hadoop while also taking advantage of its distributed processing power. The studio supports data orchestration, data cleansing, and data profiling, and with a Data Science Pack included in Pentaho 5.1, that functionality is integrated with Pentaho's Weka data-mining tool and with the popular open-source R library with support for parallelized processing.

"In predictive analytics, 80% of the effort is getting to clean, structured data that's ready to analyze, so we've done the work to do the data transformation, enrichment, and profiling needed to turn a petabyte of unstructured data on Hadoop into data that's ready for analysis," Gallivan explained.

The Data Science Pack included with 5.1 allows R scripts as well as Weka scoring and forecasting models to be run on Pentaho Data Integration. Future releases will add data-prep support for tools including SAS, Metlab, and Mahout, said Gallivan.

Pentaho 5.1 also adds support for MongoDB, which has become "a killer, next-generation application database," said Gallivan. Pentaho is running its business intelligence, data-visualization, and OLAP tools on MongoDB's JSON data format.

"We transform the JSON to run effectively in an MDX [OLAP] environment," Gallivan explained. "MongoDB users want the richness of a data-discovery environment with data visualization against native JSON."

Pentaho's open-source software is used by more than 20,000 organizations. Among these, more than 1,500 customers pay for enterprise software and support, and at least 250 have successful big-data deployments, according to Gallivan. As Gallivan detailed in a recent interview, most of those customers fall into one of five deployment scenarios: 360-degree customer view, Internet of Things, data warehouse optimization, big-data refinery, or data security.

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

About the Author

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights