Big Data Moves Toward Real-Time Analysis - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Software Platforms
Commentary
6/23/2015
08:06 AM
Charles Babcock
Charles Babcock
Commentary
Connect Directly
Twitter
RSS
100%
0%

Big Data Moves Toward Real-Time Analysis

The data warehouse, as valuable as it is, is history. The most valuable data will be that which is collected and analyzed during the customer interaction, not the review afterward.

7 Data Center Disasters You'll Never See Coming
7 Data Center Disasters You'll Never See Coming
(Click image for larger view and slideshow.)

It's clear there's a transformation in enterprise data handling underway. This was evident among the big data aficionados attending the Hadoop Summit, in San Jose, Calif., and the Spark Summit in San Francisco earlier this month.

One phase of this transformation is in the scale of the data being accumulated, as valuable "machine data" piles up faster than sawdust in a lumber mill. Another phase, one that's less frequently discussed, is the movement of data toward near real-time use.

Thomas Davenport, writing June 3 in the Wall Street Journal's online CIO Report, said the shift in data architectures toward the much larger capacities of distributed systems, such as Hadoop, from one large relational database server, was "transformational." But he left out the element of rapid-fire timing. The digital economy demands not only analysis of prodigious amounts of data, but the ability to process it -- sort out nuggets -- in near real-time.

The data warehouse, as valuable as it is, is history. The most valuable data will be that which is collected and analyzed during the customer interaction, not the review afterward. The analysis that counts is not the results of the last three months, or even the last three days, but the last 30 seconds -- probably less.

In the digital economy, interactions will occur in near real-time. Data analytics will need to be able to keep up. Hadoop and its early implementers, such as Cloudera and Hortonworks, have risen to prominence based on their mastery of scale. They gobble data at a prodigious rate, one that was inconceivable a few years ago.

"We see 50 billion machines connected to the Internet in five to ten years," said Vince Campisi, CIO of GE Software, at the Hadoop Summit. "We see significant convergence of the physical and digital world." The convergence of the physical operation of wind turbines and jet engines with machine data means the physical object gets a virtual counterpart. Its existence is captured as sensor data and stored in the database. When analytics are applied, its existence there can take on a life of its own, and the system can predict when parts will break down and cause real-life operations to grind to a halt.

(Image: Max Altana/iStockphoto)

(Image: Max Altana/iStockphoto)

But Davenport's outline of the transformation was incomplete. It didn't include the element of immediacy, of near real-time results needed as data is analyzed. It's that immediacy element that IBM was acting on as it issued its ringing endorsement of Apache Spark.

Spark is the new kid on the block, an in-memory system that's not exactly unknown, but is still a stranger in data warehouse circles. IBM said it would pour resources into Spark, an Apache Foundation open source project.

"IBM will offer Apache Spark as a service on Bluemix, commit 3,500 researchers to work on Spark-related projects, donate IBM SystemML to the Spark ecosystem, and offer courses to train 1 million data scientists and engineers to use Spark," wrote InformationWeek's William Terdoslavich after IBM's announcement.

Is it wise to focus as much attention and effort on Spark? The big data field is basically in ferment. There's RethinkDB, an ambitious Redis project or, for that matter, commercial in-memory SAP Hana. With so many initiatives underway, was it wise for IBM to announce that Spark is "potentially the most significant open source project of the next decade"?

It's always tempting to ask: Significant to whom? Big data users, who need its speed? Or IBM, which was caught flat-footed by the NoSQL wave. Now IBM is clearly looking for fresh options, and in Spark, it's found one.

Doug Henschen, formerly with InformationWeek and now part of Constellation Research, had this to say in his blog after the IBM endorsement: 

"IBM execs told analysts at the company’s new Spark Technology Center [in San Francisco that] it’s an all-in bet to integrate nearly everything in the analytics portfolio with Spark. Other tech vendors betting on Spark range from Amazon to Zoomdata …"

In addition, IBM executives explained the salient features of Spark that they liked:

  1. The task of data conversion and loading is handled automatically, allowing the Spark user to concentrate on data analysis, not data movement.
  2. Spark is flexible in its data processing capabilities. It's a platform where the task can be distributed, scheduled, and given proper I/O capacity, while the data gets filtered, reduced, and joined as needed.
  3. Its in-memory feature gives it an outlandish speed advantage over classic Hadoop, which relies on MapReduce, a disk-based system. In short, it excels at performance.
  4. It can host SQL queries, perform machine learning analytics, Spark Streaming data analysis and the analytics in the recently released SparkR language coming out of Berkeley.

IBM said it would run its own analytics software on top of Spark, including SystemML for machine learning, SPSS, and IBM Streams.

Henschen concluded that the combination of analysis capabilities being built on top of Spark, along with its ability to make use of distributed, in-memory computing, was going to give it an edge in the long run. "By blending machine learning and streaming, for example, you could create a real-time risk-management app," he wrote. What’s more, Spark supports development in Scala, Java, Python, and R, which is another reason the community is growing so quickly.

At Spark Summit, Amazon Web Services announced a free Spark service running on Amazon Elastic Map Reduce, and IBM announced plans for Spark services on BlueMix (currently in private beta) and SoftLayer. These cloud services will open the floodgates to developers, and IBM’s contributions will surely help to harden the Spark Core for enterprise adoption.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
shakeeb
50%
50%
shakeeb,
User Rank: Ninja
6/30/2015 | 11:21:38 AM
Re: Data
 I am on your side. However do you think people focus more on data integrity? In my view I think this is a big question.
jagibbons
50%
50%
jagibbons,
User Rank: Ninja
6/29/2015 | 12:38:57 PM
Re: Data
Good point, @shamika. I think analysts will find that there is more GIGO when they are doing near-real-time analysis. There is a place for real-time to spot instant trends (think healthcare issues like flu and other outbreaks, air transportation, datacenter network resources as a few example areas). Sometimes, though, the immediate trend is not very long-lived and we can look at immediate data out of context and make poor decisions. Good decision-making requires both the short and the longer-term viewpoint of the data.
shamika
50%
50%
shamika,
User Rank: Ninja
6/29/2015 | 11:42:08 AM
Data
I agree with the first point. This will allow the analyst to focus more on data analysis. However it is also important to look at the quality of data. Otherwise it will fall under garbage in garbage out concept. 
Slideshows
What Digital Transformation Is (And Isn't)
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/4/2019
Commentary
Watch Out for New Barriers to Faster Software Development
Lisa Morgan, Freelance Writer,  12/3/2019
Commentary
If DevOps Is So Awesome, Why Is Your Initiative Failing?
Guest Commentary, Guest Commentary,  12/2/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
Slideshows
Flash Poll