Hadoop advocates say analysis of unstructured data yields predictive analytics useful to healthcare, business, and the Internet of Things machine maintenance.

Charles Babcock, Editor at Large, Cloud

June 10, 2015

5 Min Read
<p align="left">(Image: matdesign24/iStockphoto)</p>

IT Hiring, Budgets In 2015: 7 Telling Stats

IT Hiring, Budgets In 2015: 7 Telling Stats

IT Hiring, Budgets In 2015: 7 Telling Stats (Click image for larger view and slideshow.)

Hadoop's role has transformed from where it excelled when it first burst upon the scene. Today it's less the batch processing engine for big data in bulk and more the platform on which many refined data skills may be displayed. Using it in all its capacities will prove "transformational to companies," said Rob Bearden, CEO of Hortonworks.

He was backed up by speakers from Microsoft, Forrester Research, and Enterprise Technology Research as the Hadoop Summit got underway Tuesday in San Jose, Calif. The conference runs through Thursday. Hadoop is making it possible to gather many dissimilar types of data and use them together in analytical processes that have been difficult or impossible to do in the past, the speakers said.

For one Hortonworks customer, Hadoop serves as the collection and analysis engine for the self-reported information from a fleet of trucks. A major trucking firm set up a system to analyze lead acid battery performance and alert maintenance when any dipped to 15% capacity. Replacing a battery at that point eliminates most of the risk that a dead battery will cause a failed start on time or failure to deliver goods due, said Bearden. Being able to monitor battery operations saved the company (which Bearden said he couldn't identify) $10 million in the first year the practice was implemented, he said.

Bearden offered the battery anecdote in the opening keynote to 4,000-plus attendees at the eighth annual gathering of Hadoop users, developers, and enthusiasts. The first, in 2007, was attended by 150 people, he recalled.

[Want to learn more about Hadoop? See Hortonworks Deploys Hadoop Into Public Clouds.]

Hadoop started out as a distributed file system, HDFS, governed by MapReduce, which determined how data could best be processed by CPUs either on or close to the cluster node where the data was stored. Today, thanks to the Apache Software Foundation's Hadoop-related projects (such as Spark), Hadoop also serves as a processor of real-time streaming data.

The result will be data engines that don't much resemble relational databases or much of anything that's come before. Thomas Davenport, writing in the Wall Street Journal's CIO Report June 3, said:

This new architecture involves not only Hadoop, but an entire series of new technologies. What they have in common is that many are open source, accommodate a wide variety of data types and structures, run on commodity hardware, [and] are somewhat challenging to manage.

"The shift is playing out in real time," said Bearden, quoting from the Davenport piece. The Hadoop open source software stack will unlock the relevant and valuable customer data needed for an interaction "before there's been a transaction," he said.

Predictive Results

Collecting data on truck batteries and other components is just one aspect of what can be accomplished with the approaching Internet of Things, but it will require data engines capable of handling massive amounts of data. For example, Hadoop and the systems built atop it, like Apache's HBase, which uses the Hadoop Distributed File System to hunt for and sort small, valuable sets of data within a much larger collection, can easily identify and alert companies about mechanical parts about to fail. Proper maintenance of expensive machines -- such as airplanes, locomotives, and wind turbines -- will take on a different meaning, one where the machines are not allowed to fail under most operating conditions, rather than fixed after they fail, Bearden said.


Using the Hortonworks Data Platform, a version of Hadoop, United Healthcare tapped into a variety of unstructured customer data to predict which patients were most likely to fail to take their medications. A patient with diabetes who doesn't take medications as directed is more likely to end up with complications, leading to a 37% increase in the cost of care, compared to one who takes medicines as directed and avoids complications. The ability to predict some patient behaviors is in its infancy, but it has the prospect of disrupting and transforming how healthcare is delivered, Bearden said.

"This entire paradigm (of predictive results) is in its infancy," he added.

Microsoft's T. K. Rengargan, corporate VP for the Azure Data Platform, came to the stage to show a map of the radiation found in the area around the Fukushima reactor in Japan after it experienced a meltdown following a tsunami in March 2011. No one knew how to measure the radioactivity until a local agency gave 500 people Geiger counters and told them to upload readings to Microsoft's HDInsight data platform, based on Hortonworks' version of Hadoop. They did so and, for the first time, a map of the varying levels of radiation was drawn for miles around the meltdown.

"What an amazing use of data that could change people's lives," he remarked.

Microsoft's HDInsight includes implementations of Apache Storm, Apache HBase, the Pig language for building Hadoop applications, Apache Hive data warehouse, Apache Sqoop for transferring data from relational databases into Hadoop, Apache Oozie workload scheduler, and Apache Ambari, an operational framework for managing Hadoop clusters.

Thomas DelVecchio, founder and director of Enterprise Technology Research, took the stage to declare: "2015 is the year that Hadoop open source took off. There's no better way to invest your spending priorities."

Michael Gualtieri, Forrester Research analyst, told the attendees: "If you're not an expert in predictive analytics, you need to get there." Eventually, 100% of all large enterprises will adopt some form of Hadoop, he predicted.

Actual uptake of Hadoop trails these predictions. Gartner estimated that 11% of large enterprises will invest in Hadoop in the next 12 months, another 7% in 24 months, while 26% have deployed it, experimented with it, or have a pilot project underway. Those figures, however, may change as the value of predictive analytics becomes more widely understood.

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights