Former LinkedIn chief scientist DJ Patil shares advice on turning large-scale data into useful products.
10 Lessons Learned By Big Data Pioneers
(click image for larger view and for slideshow)
DJ Patil doesn't boast of wrestling with Big Data. At LinkedIn, he saw his role as "making big data small"--and more important, making data of any size useful.
In a presentation on "Data Jujitsu" at Web 2.0 Expo, a UBM TechWeb/O'Reilly Media event in New York, and a separate interview at the conference, Patil expounded on his vision of data science and the role of the data scientist. DJ Patil is a former chief scientist, chief security officer, and head of the data and analytics teams at LinkedIn. He is currently serving as data scientist in residence at Greylock Partners, a venture capital firm where LinkedIn founder Reid Hoffman is a partner.
Evangelizing the role of the data scientist is also the subject of "Building Data Science Teams," one of a series of chapter-length booklets he has published with O'Reilly Media (presumably leading up to a book), and of the O'Reilly Strata conference series.
While Wall Street quants and other math and data savants have been operating in the world of finance and supply chain optimization for some time, the generation of data professionals who grew up around Internet companies had a particular need to crunch large amounts of data, often unstructured or poorly structured data, and do it very cheaply, Patil said. That is why this feels like a new discipline. The label "data scientist" is something he cooked up in conversations with Jeff Hammerbacher, an early Facebook employee, at a time when they were both wrestling with many of the same problems and trying to recruit people with the same scarce skill sets. Hammerbacher is now chief scientist at Cloudera, an open source data analytics firm.
Even though Silicon Valley technology companies are intensely competitive, they can also be intensely cooperative where it makes sense, Patil said. "It's a little like the era of trading ships, where no matter how good a ship you have, the harbor has to work for everybody," he said. The harbor, in this case, was an assortment of emerging open source technologies like Hadoop and the techniques for working with them. It wasn't so much that no one had ever wrestled with data analysis problems on this scale before--certainly others had done so in the realms of high finance, or predicting the weather, or analyzing the behavior of subatomic particles in an atom smasher at CERN. The issue, Patel said, was that no one "had solved it in a way we could afford," he said.
In choosing a label for the people who would work this magic, "analyst" made the person sound like someone who was rating stocks, while the "engineer" didn't fit well with some of professionals who came with more of an academic or scientific background, and "research scientist" sounded like someone playing with far future experimental concepts rather than the day-to-day chores of crunching user profiles and social graphs.
"We came up with the term 'data scientist' literally to get HR off our backs," Patil said, because once they settled on a name for the role, it was easier to hire for it. More importantly, once data scientist was recognized as a job title within LinkedIn, it became possible to define data science as a distinct and respected product specialization within LinkedIn.
Some of the products they created included the now familiar "people you may know" widget, which uses an analysis of the other people in your network to suggest other individuals who might be mutual connections. Created by Jonathan Goldman, now director of analytics and applications at Aster Data, PYMK is something other social networks have since developed their own versions of. Facebook even turns it around to allow your friends to suggest other people you ought to know, even if you don't know them yet.
Yet at the beginning, most people at LinkedIn thought PYMK was a stupid idea. Why would it be needed when LinkedIn already had an address book importer to pull in a member's connections based on their email contacts? It was only after PYMK was exposed to LinkedIn's membership in a small trial--that got a big response--that management recognized the potential of it, Patil said.
That is what he means by data jujitsu, where jujitsu is the art of using an opponent's leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution--without investing disproportionate resources at the early experimental stage. That's as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.
"We're trying to flip it in a clever way where we're putting it out there for people to experience," he said.
In another case, Patil's team was working on an early prototype of a system to present recruiter recommendations mined from LinkedIn profiles whenever a new job was posted on the service. The first hint he had that the product had real potential was when one of the salespeople came to complain that the service was down--not surprising, given that the product was still running on the developer's laptop at that stage. Once LinkedIn realized it had a potential new product, it progressed to offering an email that would go out to job posters including a roundup of potential candidates--clearly advertised as an experimental program, with an invitation for customers to tell LinkedIn if the service was useful. Before long, the service rose in importance to where the product team wanted companies posting jobs to see the suggested candidates immediately after posting a listing, which meant investing much greater engineering resources.
By ratcheting up attention on a product in this way, LinkedIn was able to test each stage to ensure that it was worth the next level of investment, based on feedback from real users, Patil said.
One of the major challenges of working with Big Data and sophisticated analytics is finding the right way to display it all--or maybe just understanding that you don't have to display all of it, just what will make sense to the user.
Patil warned of the dangers of "data vomit"--a term interaction designer Hannah Donovan also used in her Web 2.0 Expo workshop on the design issues of data rich websites--where the user interface presents the users with an overwhelming series of choices.
Patil said LinkedIn made this error with an early version of the user interface that allowed users to see who had viewed their profile recently. He showed a screenshot of the original user interface, featuring a full business intelligence-style dashboard of charts and graphs, with all sorts of options to drill down through the results for more detail.
"This much data on the page has the effect of paralyzing the user," Patil said. Much better to present an essential subset of data and let users request more if they want more, he said.
Similarly, data scientists risk using their predictive analytics skills in a way that annoys users, where a recommendation engine gets the wrong idea about their tastes and refuses to admit it is wrong. He cited Pandora as an example of a consumer application doing a better job of making recommendations, but then backing off politely when the user turns thumbs down on a suggested song. "It's not the data overlord telling you how it should be," he said.
LinkedIn faced some distinct challenges when trying to design its job opening recommendations, Patil said. A typical predictive engine for presenting advertisements to a user might establish its best guess of the individual's income and buying power and throw in a few recommendations on the high and low sides of that estimate. Yet when recommending to someone what their next job should be, you'll likely get a very negative response if you recommend a position that's at the same level or lower, when they want to see themselves climbing the ladder of achievement.
"I guarantee you, if you make 10 recommendations to that person, and one of them is off, they're going to think it's a terrible, terrible product," Patil said. The product has to take into account not only the data and the analysis, but how users will react to that analysis.
These remain hard problems, with years' worth of challenges ahead for those with the mettle to take them on.
Most companies have little control over their heaps of unstructured data, our research reveals. It's time for a content management strategy. Get the new, all-digital InformationWeek supplement. Download it now. (Free with registration.)