Big Data. Big Decisions
InformationWeek
Special Coverage Series


Web 2.0 Expo: LinkedIn's Big Data Lessons Learned

Former LinkedIn chief scientist DJ Patil shares advice on turning large-scale data into useful products.

10 Lessons Learned By Big Data Pioneers
10 Lessons Learned By Big Data Pioneers
(click image for larger view and for slideshow)
DJ Patil doesn't boast of wrestling with Big Data. At LinkedIn, he saw his role as "making big data small"--and more important, making data of any size useful.

In a presentation on "Data Jujitsu" at Web 2.0 Expo, a UBM TechWeb/O'Reilly Media event in New York, and a separate interview at the conference, Patil expounded on his vision of data science and the role of the data scientist. DJ Patil is a former chief scientist, chief security officer, and head of the data and analytics teams at LinkedIn. He is currently serving as data scientist in residence at Greylock Partners, a venture capital firm where LinkedIn founder Reid Hoffman is a partner.

More Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

Evangelizing the role of the data scientist is also the subject of "Building Data Science Teams," one of a series of chapter-length booklets he has published with O'Reilly Media (presumably leading up to a book), and of the O'Reilly Strata conference series.

While Wall Street quants and other math and data savants have been operating in the world of finance and supply chain optimization for some time, the generation of data professionals who grew up around Internet companies had a particular need to crunch large amounts of data, often unstructured or poorly structured data, and do it very cheaply, Patil said. That is why this feels like a new discipline. The label "data scientist" is something he cooked up in conversations with Jeff Hammerbacher, an early Facebook employee, at a time when they were both wrestling with many of the same problems and trying to recruit people with the same scarce skill sets. Hammerbacher is now chief scientist at Cloudera, an open source data analytics firm.

[Learn more about Oracle's Big Plans For Big Data Analysis.]

Even though Silicon Valley technology companies are intensely competitive, they can also be intensely cooperative where it makes sense, Patil said. "It's a little like the era of trading ships, where no matter how good a ship you have, the harbor has to work for everybody," he said. The harbor, in this case, was an assortment of emerging open source technologies like Hadoop and the techniques for working with them. It wasn't so much that no one had ever wrestled with data analysis problems on this scale before--certainly others had done so in the realms of high finance, or predicting the weather, or analyzing the behavior of subatomic particles in an atom smasher at CERN. The issue, Patel said, was that no one "had solved it in a way we could afford," he said.

In choosing a label for the people who would work this magic, "analyst" made the person sound like someone who was rating stocks, while the "engineer" didn't fit well with some of professionals who came with more of an academic or scientific background, and "research scientist" sounded like someone playing with far future experimental concepts rather than the day-to-day chores of crunching user profiles and social graphs.

"We came up with the term 'data scientist' literally to get HR off our backs," Patil said, because once they settled on a name for the role, it was easier to hire for it. More importantly, once data scientist was recognized as a job title within LinkedIn, it became possible to define data science as a distinct and respected product specialization within LinkedIn.

Some of the products they created included the now familiar "people you may know" widget, which uses an analysis of the other people in your network to suggest other individuals who might be mutual connections. Created by Jonathan Goldman, now director of analytics and applications at Aster Data, PYMK is something other social networks have since developed their own versions of. Facebook even turns it around to allow your friends to suggest other people you ought to know, even if you don't know them yet.

Yet at the beginning, most people at LinkedIn thought PYMK was a stupid idea. Why would it be needed when LinkedIn already had an address book importer to pull in a member's connections based on their email contacts? It was only after PYMK was exposed to LinkedIn's membership in a small trial--that got a big response--that management recognized the potential of it, Patil said.

That is what he means by data jujitsu, where jujitsu is the art of using an opponent's leverage and momentum against him. In data jujitsu, you try to use the scope of the problem to create the solution--without investing disproportionate resources at the early experimental stage. That's as opposed to data karate, which would be a direct frontal assault to hack your way through the problem.

"We're trying to flip it in a clever way where we're putting it out there for people to experience," he said.

In another case, Patil's team was working on an early prototype of a system to present recruiter recommendations mined from LinkedIn profiles whenever a new job was posted on the service. The first hint he had that the product had real potential was when one of the salespeople came to complain that the service was down--not surprising, given that the product was still running on the developer's laptop at that stage. Once LinkedIn realized it had a potential new product, it progressed to offering an email that would go out to job posters including a roundup of potential candidates--clearly advertised as an experimental program, with an invitation for customers to tell LinkedIn if the service was useful. Before long, the service rose in importance to where the product team wanted companies posting jobs to see the suggested candidates immediately after posting a listing, which meant investing much greater engineering resources.

By ratcheting up attention on a product in this way, LinkedIn was able to test each stage to ensure that it was worth the next level of investment, based on feedback from real users, Patil said.

One of the major challenges of working with Big Data and sophisticated analytics is finding the right way to display it all--or maybe just understanding that you don't have to display all of it, just what will make sense to the user.

Patil warned of the dangers of "data vomit"--a term interaction designer Hannah Donovan also used in her Web 2.0 Expo workshop on the design issues of data rich websites--where the user interface presents the users with an overwhelming series of choices.

Patil said LinkedIn made this error with an early version of the user interface that allowed users to see who had viewed their profile recently. He showed a screenshot of the original user interface, featuring a full business intelligence-style dashboard of charts and graphs, with all sorts of options to drill down through the results for more detail.

"This much data on the page has the effect of paralyzing the user," Patil said. Much better to present an essential subset of data and let users request more if they want more, he said.

Similarly, data scientists risk using their predictive analytics skills in a way that annoys users, where a recommendation engine gets the wrong idea about their tastes and refuses to admit it is wrong. He cited Pandora as an example of a consumer application doing a better job of making recommendations, but then backing off politely when the user turns thumbs down on a suggested song. "It's not the data overlord telling you how it should be," he said.

LinkedIn faced some distinct challenges when trying to design its job opening recommendations, Patil said. A typical predictive engine for presenting advertisements to a user might establish its best guess of the individual's income and buying power and throw in a few recommendations on the high and low sides of that estimate. Yet when recommending to someone what their next job should be, you'll likely get a very negative response if you recommend a position that's at the same level or lower, when they want to see themselves climbing the ladder of achievement.

"I guarantee you, if you make 10 recommendations to that person, and one of them is off, they're going to think it's a terrible, terrible product," Patil said. The product has to take into account not only the data and the analysis, but how users will react to that analysis.

These remain hard problems, with years' worth of challenges ahead for those with the mettle to take them on.

Most companies have little control over their heaps of unstructured data, our research reveals. It's time for a content management strategy. Get the new, all-digital InformationWeek supplement. Download it now. (Free with registration.)



Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

BYTE encourages readers to engage in spirited, healthy debate, including taking us to task. However, BYTE moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. BYTE further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.

Follow InformationWeek

By The Numbers

What Are Your Primary Concerns About Using Big Data Software?

Base: 417 respondents at organizations using or planning to deploy data analytics, BI or statistical analysis software
Data: InformationWeek 2013 Analytics, Business Intelligence and Information Management Survey of 541 business technology professionals, October 2012

What Do You Think?

What's your attitude about SQL analysis on top of Hadoop?
We want fast, standard SQL analysis capabilities on Hadoop ASAP
Hadoop is for unstructured data; SQL is for relational databases
We'll give SQL on Hadoop a try, but relational DBs will remain the mainstay
Given strong SQL support on Hadoop, we'd nix the data warehouse
We're not interested in Hadoop
No opinion



Related Content

From Our Sponsor

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Five Big Data Challenges and How to Overcome Them with Visual Analytics

Business leaders often need a visual snapshot of data to quickly grasp and use it. This paper identifies five challenges in presenting data and how visual analytics can resolve them. Solutions are suggested to overcome the challenges of: speed, data clarity, data quality, displaying meaningful results, and dealing with outliers.

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Game-Changing Analytics: How IT Executives Can Use Analytics to Create Innovation and Business Success

Today's competitive advantage requires a deeper understanding of your business, your market and your customers. As an IT executive, you can drive that knowledge transformation. In this white paper, learn how to make decisions as a strategic business leader and three steps to begin an analytics initiative within your enterprise.

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

Data Visualization Techniques: From Basics to Big Data with SAS Visual Analytics

High-performance data visualization turns sophisticated analyses into meaningful graphics, leading to faster and smarter decision making. In this white paper, learn how visual analytics can transform big data, with additional features such as real-time functionality, mobile compatibility, robust applications for technical groups and accessibility for nontechnical users.

Big Data: Lessons from the Leaders

Big Data: Lessons from the Leaders

Financial performance, competitive advantage, operational efficiency, strategic decision making - every business goal can extract value from big data, and the time for doubt or inaction has long passed. In this Economist Intelligence Unit report, in-depth interviews with data pioneers reveal the link between the effective use of big data and the bottom line among other results.

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Decision-Driven Data Management: A Strategy for Better Decisions with Better Data

Which came first, the data or the decision? This white paper makes the case for having a decision in mind, then tailoring big data's volume, variety and velocity to achieve business results such as overcoming customer dissatisfaction or creating well-informed strategies in real time.

Informationweek Reports

Research: The Big Data Management Challenge

Research: The Big Data Management Challenge

The challenge of big data is real, but most organizations don't differentiate 'big data' from traditional data, and nearly 90% of respondents to our survey use conventional databases as the primary means of handling data. We'll help you understand what constitutes big data (it's not just size) and the numerous management challenges it poses.