IBM announced in June that it has embarked on a quest to create a million new data scientists. It will be adding about 230 of them during its Datapalooza educational event this week in San Francisco, where prospective data scientists are building their first analytics apps.
Next year, it will take its show on the road to a dozen cities around the world, including Berlin, Prague, and Tokyo.
The prospects who signed up for the three-day Datapalooza convened Nov. 11 at Galvanize, the high-tech collaboration space in the South of Market neighborhood, to attend instructional sessions, listen to data startup entrepreneurs, and use workspaces with access to IBM's newly launched Data Science Workbench and Bluemix cloud services. Bluemix gives them access to Spark, Hadoop, IBM Analytics, and IBM Streams.
[Want to know where Spark is heading? See Cloudera Sees Spark Emerging as Hadoop Engine.]
Rob Thomas, vice president of product development, IBM Analytics, said the San Francisco event is a test drive for IBM's 2016 Datapalooza events. "We're trying to see what works and what doesn't before going out on the road."
Thomas said Datapalooza attendees were building out DNA analysis systems, public sentiment analysis systems, and other big data apps.
Apache Spark sits at the center of IBM's education for future data scientists.
In June, IBM contributed its SystemML machine-learning engine to the Spark platform so that Spark can be used to analyze incoming streams of machine-generated data. Spark can serve both as a platform for capturing and analyzing the data and a launch pad for retrieving it from other types of data repositories for analysis.
Unlike Hadoop, which relies on data being stored to disk before retrieval for analysis, Spark can work with data placed in random access memory, speeding the pace at which it can be retrieved and used. IBM spokesmen describe Spark as 100X as fast as Hadoop when working with data in server memory.
Thomas explained that most machine-learning systems are built on a data system that uses one set of algorithms and one data model, and when data from a different machine or type of machine event is collected, it requires a different model. Spark with SystemML is much more flexible than other data platforms. With it, an existing system can be adjusted to analyze an altered data flow without requiring a whole new system, Thomas said.
Spark is so much at the heart of the way IBM sees the future of data management that the company is converting many of its internal systems to work on Spark. At this point it has also converted 15 products to being Spark-based, including its IBM SPSS statistical analysis, DataWorks data preparation and refinement, and IBM's product pricing software module that helps companies dynamically address complex pricing issues, he said.
"We reduced the number of lines of code needed for DataWorks from 40 million to 5 million" by making use of the distributed data processing available in Spark, Thomas said. Spark also simplifies what prospective data analysts need to know to get started.
"We had some people who stayed here until 2 a.m. last night; they were that engaged," said Thomas.
But will three days of attending classes and programming with other budding analysts really amount to turning out a data scientist? Thomas laughs at the question. "Most people who would like to do analytics have never built an application. Here they'll get the experience of building one," and be ready to go on to their next project. That puts them on a more direct path to becoming a data scientist than many other possibilities, he said.
In addition to the Datapalooza, IBM now operates a Spark Technology Center in San Francisco.
**New deadline of Dec. 18, 2015** Be a part of the prestigious InformationWeek Elite 100! Time is running out to submit your company's application by Dec. 18, 2015. Go to our 2016 registration page: InformationWeek's Elite 100 list for 2016.