What's a dream job for a data scientist? It just may be serving as the chief data scientist at a big technology vendor that doesn't care so much about selling analytics or big data solutions. Instead, it cares about empowering the analytics ecosystem and helping organizations apply big data analytics to any number of problems and challenges of business, science, and humanity.
That's where Intel's chief data scientist Bob Rogers is right now. He joined the chip giant in January and spends his time working both internally on Intel's own data science projects and externally as the "big data evangelist" for Intel, spreading the word about big data and helping organizations become successful with their analytics and data science projects.
"Intel sells chips. We don't sell services. We don't sell software. Our overall strategy is to empower the analytics ecosystem," he said. "I get to go around and help customers be successful without needing to sign on the dotted line or anything else. I help customers understand analytics and data problems and move the ball forward."
That's what he does externally. Among the internal projects Rogers works on with Intel are efforts to improve the IT help desk process by using semantic engines on text, adding unstructured data to improve the engine that provides successful best practices to Intel resellers, and serving as part of the Data Science Center of Excellence, a weekly and voluntary internal forum at Intel designed to enable data scientists within the company to help each other tackle big problems in their respective groups.
Rogers brings an eclectic range of training and experience to the role. He was trained as a physicist, and his PhD thesis was based on a computer simulation of what would happen to objects sucked into black holes. Over the course of his career, he worked on problems including the creation of simulations to predict the stock market for hedge funds, improving glaucoma diagnostics, and fixing doctor access to all relevant patient information. It was in that last role as cofounder of Apixio that he started working with Intel.
Apixio applies big data to problems with electronic health records (EHR).
"Because each EHR is a silo. What we discovered in our studies is that 65% of the information doctors should know about you is not in structured data." In the example of apparent heart failure, about 30% of cases are not actually heart failure. There are false negatives and false positives, and much of that information is in email, not EHRs. That means it's unstructured data.
A Different Approach
Now as chief data scientist at Intel, Rogers has a vantage point to see a wide array of challenges and possibilities enabled by big data across many different types of organizations. What is the biggest mistake that organizations make when implementing big data projects?
"One of the biggest problems I see is that enterprises want to build a big data stack, shove all their data in, and hope that insights bubble to the surface," he said. "But at the end of the day that's a good way to end up with an expensive project that doesn't seem to show any value."
Instead of this, he advocates more of an Agile approach.
"Start with a specific challenge and then build the minimum infrastructure needed," he said.
Intel itself has built up its own big data infrastructure internally alongside traditional business intelligence infrastructure, and those two systems talk to each other, according to Rogers.
In an initial project, Intel built a recommendation engine, by tapping into both structured and unstructured data, in order to help Intel offer resellers insights about how to be more successful, Rogers said. The company built up this big data stack on Hadoop, using Cloudera. The important component of this was adding the unstructured data.
"The structured data is the same data you've been looking at for years," he said. "But if you add even a small amount of the unstructured data, you get a huge step forward in performing and creating value. That's one of the big areas that I see data science advancing in very rapidly."
One of the other keys to success is crafting the right question. Rogers recently sat on a panel with a handful of other high-level data scientists at New York University, addressing the school's data science graduate students. They wanted to know about skills that are important to be successful in the field.
"There are technical skills -- math and statistics and modeling and computer science," he said. "Then there is understanding the business needs." Realistically, there aren't any people who embody all the skills of the perfect data scientist, he said. That's why it's important to be able to work collaboratively and draw on the skills of a team.
"We are looking for people with a mix of skills. What's really important is the ability to handle ambiguity. … [Y]ou may not have an exact answer that is analytically measurable, and that can put people outside their comfort zone.
"Another aspect is creativity," Rogers said. That means being able to look at a problem from multiple directions. For instance, instead of lumping all car buyers into one group, if you break them apart according to demographics and think about them separately, you will learn much more.
[Looking for more on Intel's big data initiatives? Read Intel's TAP Big Data Platform Gains Healthcare, Cloud Partners.]
"Those attributes are at least as important as the technical skills," Rogers said. "Data is messy."
In terms of tools and programming languages, the best tools for fledgling data scientists to start with and learn are R and Python, according to Rogers. The most exciting developments in terms of languages that are evolving for big data are functional languages like Scala that let you write complex analytics that immediately scale across huge clusters.
Looking ahead to the future of big data and analytics, Rogers said he believes two of the most exciting areas offering big potential yet to be realized are the Internet of Things (IoT) and unstructured data.
In terms of IoT, Rogers said he believes there will be a great impact from having intelligence at the edge, for instance, in terms of wearables for healthcare applications and sensors coupled with analytics for smart cities and transportation.
While we have gotten good at analyzing text in terms of unstructured data, there is still much work to be done to understand images, video, and audio, Rogers said.
"Intel is working very hard in this area," Rogers said. "That's an exciting area of growth and advancement."