Apache Spark: 3 Promising Use-Cases - InformationWeek
IoT
IoT
Data Management // Big Data Analytics
Commentary
3/27/2015
11:36 AM
James Kobielus
James Kobielus
Commentary
100%
0%

Apache Spark: 3 Promising Use-Cases

Spark is the shiny new thing in big data, but how will it stand out? Here's a look at "fog computing," cloud computing, and streaming data-analysis scenarios.

IBM Watson: 10 New Jobs For Cognitive Computing
IBM Watson: 10 New Jobs For Cognitive Computing
(Click image for larger view and slideshow.)

To survive the competitive struggles, every fresh technological innovation must find clear use-cases in the marketplace. There must be some specific itch that the new approach can scratch at least as well, and hopefully much better, than the alternatives.

As the mania for Apache Spark grows in the big-data analytics arena, we must remember that it's still an unproven technology. The early crop of commercial solutions that implement Spark haven't yet converged on distinctive use-cases that call for Spark and no, say, Hadoop, NoSQL, or established low-latency analytics technologies. What is Spark's application sweet-spot?

When you ponder Spark's prospects, you must consider a related question. What exactly are the core deployment models and use-cases for which Spark is best suited in today's crowded big data marketplace? What differentiators does Spark have over rival platforms, whether open source or proprietary, for addressing these requirements? And do these differentiators, taken as a whole, provide sufficient impetus for Spark to find its commercial sweet-spot rapidly and thereby achieve widespread adoption?

[ Want more on Spark? Read Spark Promoter Databricks Should Let Software Shine. ]

With these questions hanging in your mind, here are the principal deployment models in which Spark may prove its value in real-world applications:

Fog

The Internet of Things (IoT) may spell the end of data centers as we've traditionally known them. Data centers' core functions -- processing and storage -- are increasingly being decentralized out to the network's edges. The IoT is also greatly expanding the need for distributed, massively parallel processing of huge amounts of machine and sensor data of all sorts. Not just that, but the analytics required in these "fog computing" scenarios will increasingly emphasize low-latency, massively parallel processing of machine learning and graph analytics algorithms of great complexity.  

As I detail here, fogs are clouds in which the primary processing nodes are network-edge endpoints, such as sensor-laden Internet of Things (IoT) devices. Fogs distribute the storage, bandwidth and other cloud resources out to the IoT endpoints, most of which are embedded deeply in the hardware infrastructure of the end applications. These fog requirements feel tailor-made for Spark, which includes an interactive real-time query tool (Shark), a machine-learning library (MLib), a streaming-analytics engine (Spark Streaming), and a graph-analysis engine (GraphX). As the IoT industry converges, sometimes haltingly, toward a common fog infrastructure, Spark may just fulfill that niche better than any other open source platform.

Cloud

Spark, building on HDFS, clearly has the ability to shoulder practically any Hadoop cloud deployment model and use-case, not just those associated with the IoT. As a start, Spark can access and process data stored in HDFS, HBase, Cassandra, and any other Hadoop-supported storage system. As a general-purpose cloud platform, Spark boasts performance advantages vis-à-vis Hadoop, most notably Spark's ability to parallelize models in real-time across distributed in-memory clusters. And unlike Hadoop's MapReduce, Spark can combine SQL, streaming, and graph analytics within cloud analytics applications. Clearly, the cloud market seems ripe for Spark, especially in an era where distributed, heterogeneous storage layers, streaming low-latency middleware, and in-memory cloud platforms are in the ascendance.

Stream

Spark may ride its adoption in IoT and cloud environments to become ubiquitous for stream-computing applications of all kinds. Some industry observers question whether Spark truly supports all the key requirements for robust stream processing. One might argue that other open source stream-computing platforms, such as Apache Storm and Apache Samza, have better performance, functionality, or development features than Spark for these use-cases. But one might just as well argue that Spark's advantages as a fog and cloud analytics platform lessen the need for it also to be the slam-dunk choice for stream computing. If Spark can support streaming analytics reasonably well for the majority of use-cases, it might also become the standard there as well.

Apache Spark supports SQL, machine-learning, graph, and streaming analysis against a range of data types, and in multiple development languages.

Apache Spark supports SQL, machine-learning, graph, and streaming analysis against a range of data types, and in multiple development languages.

What might prevent Spark from achieving widespread adoption in any or all of these markets is not just the presence of established platforms and tools (e.g., Hadoop) that adequately address 90% of the core use-cases. Over the next two to three years, the key obstacle to widespread Spark adoption may simply be Spark's immaturity, the paucity of field-proven, enterprise-grade Spark platforms, and the lack of a well-developed ecosystem of Spark tools, libraries, and applications.

Considering that many enterprises have now committed to Hadoop and various NoSQL platforms as their strategic big-data platforms, they may be reluctant to commit to Spark until it has truly proved its value in a sufficient number of real-world deployments. Likewise, most organizations with stream-computing requirements have already committed themselves to a commercial solution or perhaps an alternative open source platform.

Spark is the latest shiny new big-data bauble. To make the most of its "next-big-thing" status, Spark promoters will need to generate actual user demand for the technology. Advocates should avoid pitching it at customers who’ve become jaded by the incessant drumbeat for all things big data, especially Hadoop.

Spark will naturally float to its proper level in the big data ocean. Hyping it out of proportion to its competitive differentiation would only inspire a backlash. And that would be counterproductive for Spark in the long run, and deter potential users from considering it long before they’ve had their first serious opportunity to kick the tires.

Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.

James Kobielus is IBM's Big Data Evangelist. He is an industry veteran who spearheads IBM's thought leadership activities in big data, data science, enterprise data warehousing, advanced analytics, Hadoop, business intelligence, data management, and next best action ... View Full Bio
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
asksqn
50%
50%
asksqn,
User Rank: Ninja
3/30/2015 | 4:46:30 PM
Oorah for Spark
I'm looking forward to Spark given its above referenced versatility and the fact that is is open source.  I see nothing but very good things in Spark's future.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
3/28/2015 | 7:30:00 AM
Re: Fog?
I had not heard the term "fog" either, but apparently it's a phrase Cisco is using as we'll. Follow the link in James' column to read more about it. As for precision, is "cloud" any more precise? Or "big data"? Sometimes terms manage to catch on as shorthand for a big collection of things. Fog's not there yet, so it sounds a bit forced.
Thomas Claburn
50%
50%
Thomas Claburn,
User Rank: Author
3/27/2015 | 6:11:41 PM
Fog?
"Fog" computing just rubs me the wrong way. Why not aim for precision with "distributed computing" instead?
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
3/27/2015 | 12:48:15 PM
Interesting left-handed compliments here
James doesn't really sound all that enthusiastic about Spark, describing it as unproven and underplaying, in my view, the interest, adoption, and number of proven applications on the platform. Interestingly, IBM sponsored Spark Summit East a couple of weeks ago, but it's probably hoping to sell its own commercial solutions to the attendees of that event. It is officially supporting Spark, but it doesn't have a distribution of or integration with that software as yet.

As I understand it, Spark has more than 500 enterprise adopters, and Spark promoter Databricks has more than 50 beta customers for its Databricks Cloud service based on Spark. Streaming data analysis is just one play for Spark, which makes it a competitor to IBM InfoSphere Streams. How "proven" is Streams, I wonder, and how many customers does it have? Is InfoSphere Streams really getting into the same conversations as Spark and Storm? Big data practitioners seem to have a strong bias toward open-source options, not commercial software. Maybe open source is the real "shiny new thing" that commercial vendors are competing against.

 
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
Digital Transformation Myths & Truths
Transformation is on every IT organization's to-do list, but effectively transforming IT means a major shift in technology as well as business models and culture. In this IT Trend Report, we examine some of the misconceptions of digital transformation and look at steps you can take to succeed technically and culturally.
Video
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll