ClearStory taps HDFS and Apache Spark in the cloud to let business users blend high-scale, variable data and analyze it with in-memory speed.

Doug Henschen, Executive Editor, Enterprise Apps

March 30, 2015

4 Min Read
<p align="left">In a retail sales-analysis scenario, business users at a consumer packaged goods company could use ClearStory to blend and analyze disparate data from retailers and third-party sources. </p>

Top Jobs For STEM: Big Data, IT Product Management

Top Jobs For STEM: Big Data, IT Product Management


Top Jobs For STEM: Big Data, IT Product Management (Click image for larger view and slideshow.)

It combines the scalability and variable-data adaptability of Hadoop, the in-memory analysis speed of Apache Spark, and the agility and usability of a cloud-based tool designed for business analysts.

These are the traits that ClearStory Data promises. With a new release of its cloud service announced on Monday, the company said it's delivering greater control over data-blending and analysis, more types of analyses, and better performance, due to behind-the-scenes integration of the latest (version 1.2) data-processing engine from Apache Spark, the distributed, in-memory analytics platform.

"Previously customers would load their data and use our tool to find correlations using our data-harmonization engine, but it was almost like a black box," said Vaibhav Nivargi, ClearStory's co-founder and chief architect in a phone interview with InformationWeek. "With the new release, we're striking a balance between the simplicity of delivering automated recommendations and giving power users a lot more flexibility and control over how they harmonize data."

[ Want more on in-memory big data analysis? Read Spark Promoter Databricks Should Let Software Shine. ]

When users upload data into the ClearStory service, it's stored in on a Hadoop Distributed File System (HDFS). This infrastructure, which is managed entirely by ClearStory, lets customers blend a variety of high-scale data without predefined data modeling or complex ETL work. The data is then blended, and notable overlaps and correlations exposed after processing in Apache Spark's core in-memory query-optimization engine. Business users work in a ClearStory-developed Storyboard analysis environment rather than using Spark tools such as Spark SQL, MLlib, Spark Streaming, or GraphX.

"Business users who can conceptually understand forecasting, clustering, or segmentation don't want to be burdened with picking algorithms and parameters or creating and serializing models," said Nivargi. "With Storyboards you can do statistical operations, find correlations in data, drill in or out based on attributes in the data set, and you can bring in external data sets and create joins, which we call harmonization."

Storyboards are more flexible than dashboards, according to Nivargi, because they can be changed, adapted, and augmented with new data by business users, whereas dashboard changes often have to be handled by IT staff or power users.

With its combination of graphical data-exploration and data-analysis capabilities, the ClearStory service seems to have much in common with Databricks Cloud, the Spark-based service (currently in beta) offered by the developer and promoter of Apache Spark. Other products that come to mind include Platfora and Datameer, though these are on-premises tools (with the latter having a software-hosting option).

ClearStory is different from the Databrick Cloud because the latter is "something for more sophisticated users, including data scientists, who are comfortable coding in Scala, Spark SQL, or Python," according to Nivargi. And ClearStory doesn't compete with Platfora and Datameer, he said, because those tools are deployed on top of customer-managed Hadoop deployments. ClearStory, in contrast, manages the data infrastructure behind its services in the cloud, and that complexity is not exposed to the customer.

In another differentiator, ClearStory touts data-lineage and data-access controls required by regulated businesses. The new release is said to show the origin of source data and its original structure and shape, even after it's blended into larger data sets exposed and analyzed within ClearStory. Also new in the upgrade is a guided user model designed to enable line-of-business users without deep IT or BI training to access, prepare, blend, and harmonize data.

ClearStory boasts a high-profile list of namable customers including CocaCola, Dannon, DelMonte, and Merck.

Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights