The Strata+Hadoop Show is about to begin at New York's Jacob K. Javits Convention Center However, IBM's Rob Thomas is above it all.
As Big Blue's vice president for product development in the company's analytics division, he is trying to spot the future before it gets here, buying up the smaller companies whose technologies will flesh out IBM's cloud and big data lineups.
The building blocks are out there, but only if you know where to look.
Picture this market as a Dagwood sandwich. The bottom slice of bread is Hadoop and other databases. Thomas views this as an area that is rapidly becoming a commodity, so no value-add there. The middle slice of bread is Spark, which will "have a central role in data like Linux in operating systems," he observed.
The top slice of the bread belongs to the apps that will run on Spark as it sits over Hadoop.
It is in the layer between Spark and apps that he hopes to find the "meat" in the sandwich, made up of machine learning, artificial intelligence, and advanced analytics. This area cannot be commoditized. It should retain enough value to generate margins for IBM.
But there is a catch. The sandwich needs some bacon, lettuce, and tomato to be truly complete. That part is played by the data scientists skilled enough to develop apps in that machine learning, artificial intelligence, and advanced analytics space.
To illustrate the problem Thomas drew a triangle and then ran a line below the peak to show the capstone: The PhDs who make up the true data scientists. At the base, he drew another line, where "fake data scientists" reside -- basically the IT people who think they know what they are doing but don't.
In the middle is an "80%" space where experienced IT people can become fluent in data science techniques to make big data operative for their employers.
"I am trying to get IT folks to move up the stack," Thomas said. With that in mind, IBM is launching its "Datapalooza" tour in November, a three-day immersive "camp" that will be the starting point for IT people interested in learning how to do some of the work of data scientists to best effect. The tour will hit 12 cities worldwide.
Datapalooza will serve as a gateway for IBM's "Big Data U," where IT people can get more information online about big data and its applications.
This will also dovetail with an October release of IBM's Jupiter notebook product, available through BlueMix. This will be a "for fee" product, Thomas added.
The ramp-up to bring the IT masses to Big Data is already underway. This past June, IBM released a "data science" workbench which now has tens of thousands of users.
His goal over the next few years is to offer IT personnel a migration path to that 80% middle area in the data science pyramid he outlined earlier in the talk.
It's a matter of "giving people the right training and giving them the tools."
IT people doing big data will not rival the skillsets of a PhD, but enough skilled "data science" practitioners should be able to cover for any lack of a particular skill set for any individual on the team, he explained. Even so, training enough "data science practitioners" will take a couple of years.
PhDs, on the other hand, take five years to get through their programs and are never plentiful.
Cloud is developing concurrently with big data, but is expected to play a significant role in cultivating big data's potential.
"Cloud is moving faster than you think," Thomas added. With companies rushing to cash in on this technology, only half of current IT staffs will be needed in the next five years, he pointed out.
Fitting All Pieces Together
So how does this fit in with big data?
It is a question of technique. Traditional IT practice is to move data to where it is going to be processed. However, there is a networking issue, since bandwidth may not be able to handle petabytes of data.
In the cloud, "you bring the data to one place, then apply a different set of data services to that," Thomas said. "You create a 'fluid data layer' and you process it wherever it is. You are not constantly running ETL work to run it from one repository to the next."
Here, big data practitioners should be able to use the right toolsets to prepare the data for analysis, and apply the right model that identifies which pieces of data are relevant to solving a problem or executing a process.
That is pretty much the future as he outlined it. Getting there will take time.
Machine learning is one of those building blocks, but right now there are many small companies that are delivering niche products for specific industries or going to market with general-purpose machine learning engines. A lot of work still needs to be done to make Spark the "operating system" for Hadoop, Thomas explained.