Big Data // Big Data Analytics
News
7/8/2014
11:10 AM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Pentaho Preps Data On Hadoop, Analyzes On MongoDB

Pentaho 5.1 adds YARN support to support predictive analysis, transforms JSON for analysis on MongoDB.

10 Big Data Pros To Follow On Twitter
10 Big Data Pros To Follow On Twitter
(Click image for larger view and slideshow.)

Open-source analytics and data-integration software provider Pentaho burnished its big-data credentials with newly released support for Hadoop's YARN management layer and the popular MongoDB NoSQL database.

The new capabilities, released in late June in Pentaho's 5.1 release, enhance the company's already strong presence in the world of big data by bolstering data preparation for predictive analysis.

"The big 'aha' in getting a return on analyzing a petabyte of information is being able to predict what the customer is going to do next, whether that's buy something, commit fraud, or churn," said Quentin Gallivan, Pentaho's CEO, in a phone interview with InformationWeek. "Our vision is to befriend the data scientist by building a studio where they can orchestrate and profile their data and then use their tool of choice for prediction."

[Want more on big data analysis? Read Will Spark, Google Dataflow Steal Hadoop's Thunder?]

Support for YARN, the management layer introduced in Hadoop 2.0, is crucial to that vision because it enables Pentaho's analytics studio to operate directly on top of all the data stored in Hadoop while also taking advantage of its distributed processing power. The studio supports data orchestration, data cleansing, and data profiling, and with a Data Science Pack included in Pentaho 5.1, that functionality is integrated with Pentaho's Weka data-mining tool and with the popular open-source R library with support for parallelized processing.

"In predictive analytics, 80% of the effort is getting to clean, structured data that's ready to analyze, so we've done the work to do the data transformation, enrichment, and profiling needed to turn a petabyte of unstructured data on Hadoop into data that's ready for analysis," Gallivan explained.

The Data Science Pack included with 5.1 allows R scripts as well as Weka scoring and forecasting models to be run on Pentaho Data Integration. Future releases will add data-prep support for tools including SAS, Metlab, and Mahout, said Gallivan.

Pentaho 5.1 also adds support for MongoDB, which has become "a killer, next-generation application database," said Gallivan. Pentaho is running its business intelligence, data-visualization, and OLAP tools on MongoDB's JSON data format.

"We transform the JSON to run effectively in an MDX [OLAP] environment," Gallivan explained. "MongoDB users want the richness of a data-discovery environment with data visualization against native JSON."

Pentaho's open-source software is used by more than 20,000 organizations. Among these, more than 1,500 customers pay for enterprise software and support, and at least 250 have successful big-data deployments, according to Gallivan. As Gallivan detailed in a recent interview, most of those customers fall into one of five deployment scenarios: 360-degree customer view, Internet of Things, data warehouse optimization, big-data refinery, or data security.

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Lorna Garey
50%
50%
Lorna Garey,
User Rank: Author
7/8/2014 | 3:48:24 PM
Re: Data integration vendors are hot to get in on big data
How will Pentaho monetize this? The number of customers paying for enterprise support doesn't seem all that high.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/8/2014 | 3:29:39 PM
Re: Pentaho system, ungainly or powerful?
Sorry, but I guess the headline is potentially misleading. Data-prep on Hadoop is in service of predictive analysis (done with tools such as Pentaho Weka, R, or, soon according to Pentaho, SAS or Metlab). The support for MongoDB is a separate thing, only for BI/data-visualization style analysis (not predictive work) on the data managed by MongoDB. The two are not connected other than the fact that they are both capabilities introduced in Pentaho 5.1.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
7/8/2014 | 3:24:06 PM
Pentaho system, ungainly or powerful?
To "befriend the data scientist" is no easy task. It's all too easy to be a friend to few, stranger to many. The combinatin of Hadoop with YARN on top for data prep, with the rsults plugged into MongoDB sounds like a powerful system -- as long as the movement between the two of them is smooth.  
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
7/8/2014 | 1:13:09 PM
Data integration vendors are hot to get in on big data
When Hadoop first emerged, we all heard it would displace ETL. That's at least partially true, for some transformation processing, but now data-integration vendors -- like Informatica, Paxata, and, now Pentaho -- are saying their stuff is needed for all sorts of data prep and processing ahead of big-data analysis. It's another case of offering an alternative to clunky MapReduce processing, but I haven't talked to enough customers who have validated how useful these tools can be in big-data-analysis scenarios.

The "80% of the work" line above seems like a relic of relational data warehousing approches, but I need to hear from more practitioners -- yes, this is a naked plea for comments from practitioners -- before passing this off as an overstatement or marketing ploy.
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - August 20, 2014
CIOs need people who know the ins and outs of cloud software stacks and security, and, most of all, can break through cultural resistance.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.