IBM is making a major commitment to the future of Apache Spark, with a series of initiatives announced today. IBM will offer Apache Spark as a service on Bluemix; commit 3,500 researchers to work on Spark-related projects; donate IBM SystemML to the Spark ecosystem; and offer courses to train 1 million data scientists and engineers to use Spark.
The commitment to Spark is "right in the heart of what [IBM] has been doing," said Rob Thomas, VP for product development for IBM Analytics, in an interview. That database heritage hearkens back to earlier commitments to Linux, and even further back to IBM's DB2 database product, he said. But it is rare for IBM to make a technological bet such as Spark, he added.
"This is the future of enterprise data." Thomas continued. "Anyone using data will have to leverage Spark."
The key problem Spark resolves is access to data across the enterprise. A typical large corporation will have hundreds, if not thousands of data sets residing in different databases across its IT system.
A data scientist can certainly craft an algorithm to plumb the depths of any database. But "it takes a data scientist 90 days of work" to craft that algorithm, Thomas said. "Today, if you port it to another system, you are talking about another 90 days of work" to re-craft and adjust that algorithm in order to get it to work. Spark "eliminates that second 90 days." he said. A Spark-based system can seamlessly and transparently access and analyze any database, without additional development and delay.
[ What's in store for Hadoop? Read Will 2015 Be The 'Year Of Hadoop'?. ]
Another virtue Spark possesses is ease of use. Developers can concentrate on building the solution, instead of building an engine from scratch.
IBM sponsored a hackathon recently during which more than 100 teams crafted new Spark-based apps in about 10 days. One team made a genomic cloud system to analyze DNA samples, another created a search engine to gauge public opinion based on sentiments perceived in text. Thomas pointed to these projects as "proof of concept" to show how quickly a competent team of two or three people complete a project using Spark.
"The weakest part of Spark is the machine learning piece," Thomas noted. To that end, IBM will make available its SystemML machine learning technology to add learning capability to Spark apps, working with partner Databricks. This is not an algorithm library, but an engine that understands algorithms, Thomas said of SystemML.
While Spark looks promising, nothing will come of it without sufficient numbers of data scientists who actually use it. And data scientists don't grow on trees. IBM wants to educate about 1 million new users through a series of partnerships with AMPLab, DataCamp, MetiStream, Galvanize, and the Big Data University MOOC. The goal here is to make available a "data scientist's work bench" where users who know the R programming language can pick up Spark and its uses very quickly, Thomas said.
Ultimately, it falls to enterprises to make the best use of big data technology such as Spark. "Knowing the problem to solve—that will drive significant business value," Thomas said. CEOs are only beginning to understand how their data can be put to best use. Thomas offered the example of Moneyball, the 2003 book on how the Oakland Athletics sharpened their play of baseball through statistical analysis. "Data can make you think differently," Thomas said. And therein lies the quest for the advantages of insight.