LinkedIn Makes Its WhereHows Data Manager Open Source - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management
11:05 AM
Connect Directly

LinkedIn Makes Its WhereHows Data Manager Open Source

LinkedIn couldn't find a data set tracker to match all the data sets it was creating, so it invented one, WhereHows. The social network then made it open source.

Hadoop Ecosystem Evolves: 10 Cool Big Data Projects
Hadoop Ecosystem Evolves: 10 Cool Big Data Projects
(Click image for larger view and slideshow.)

LinkedIn has become a repository of profiles on hundreds of thousands of professionals, having captured a massive amount of information and comments. Its data engineers describe using a wide variety of data systems to capture the profiles, 14 million comments, and other essential data in 50,000 data sets.

The wide variety of structured and unstructured data, plus multiple types of data stores, lead to problems in finding, retrieving, and reusing data, or even in knowing what data the company had.

"We created many types of data pipelines, transformations, and data sets. When it came time to use it, it was hard to find the right data set," recalls Eric Sun, a LinkedIn staff data engineer, who led the project to create a metadata master of all the data sets.

The WhereHows development team. From left: Jianyong Bai, Zhen Chen, Eric Sun, Zhaonan Sun
(Image: LinkedIn)

The WhereHows development team. From left: Jianyong Bai, Zhen Chen, Eric Sun, Zhaonan Sun

(Image: LinkedIn)

Sun's project became known as WhereHows, a name both descriptive of what it does for data sets and a phonetic play on the name for the traditional data warehouse.

After two years of internal development, LinkedIn made WhereHows available as open source code under an Apache license on March 3, said Sun. For organizations with big combinations of structured and unstructured data, it might be worth a look.

Learn to integrate the cloud into legacy systems and new initiatives. Attend the Cloud Connect Track at Interop Las Vegas, May 2-6. Register now!


WhereHows consists of a central metadata repository that can provide access to the metadata via a Web portal or a programmable API. It's tied to a backend server that knows how to periodically fetch metadata on the data being collected in LinkedIn's copies of Oracle, MySQL, five Hadoop clusters, a Teradata data warehouse, and LinkedIn's own NoSQL system, Expresso, and its Pinot real-time, in-memory data system.

LinkedIn collects data on 400 million profile creators from 200 different countries and territories.

The site gets 100 million unique visitors a month, and it collects data on those visits. WhereHows has to track where the collected data is from, whether it's undergone a transformation, and whether parts of one data set have been added to another.

WhereHows, for example, can tell if a column from a relational database has been added to a data set in Hadoop. That helps LinkedIn data engineers track down data in one set that may be linked to that of another.

"We can uniformly manage structured and semi-structured data," such as relational tables and schemas along with JSON and XML data, Sun said in an interview with InformationWeek.

Although WhereHows recognizes the schema associated with a data set, it will follow the movement of the data even when it no longer matches the schema to which it was first tied.

"We're trying not to be super-strict on schemas. We're trying to leave ourselves some wiggle room," to better follow the data through its various iterations, said Sun.

Through WhereHows, a data management team can "figure out when the data was transformed" as it moves through various stages from raw or "coarse" data first collected down through various stages of refinement and use.

WhereHows will interface to the commercial data integration systems Informatica and Oozie, open source Azkaban, and UC4's and AppWorx's job schedulers. It will connect to any JDBC-based relational system, Sun noted.

When it comes to Hadoop operations, WhereHows will collect metadata on the operation of MapReduce, Pig, and Hive. It will also collect metadata from Apache Spark operations, or those of Cubert, the OLAP aggregation language.

[Want to see how fast big data use is growing? Read Gartner Magic Quadrant Advanced Analytics: Fast Growth Continues.]

It's an extremely useful tool for helping plan data migrations, Sun added. But it doesn't transform data or migrate data from one system to another itself. "It's just a chronicling and journaling system, observing and recording everything that's going on with the data," he said.

It'll also tell data engineers who the users of a data set are in an organization so that, if changes are underway, they can be contacted to check and see if they will be affected. Sun said LinkedIn produced its own metadata capture tool when it couldn't find one that spanned all its needs in the market. It will continue to make its updates available as open source code. It welcomes contributions from a community of WhereHows users, if one develops, as LinkedIn anticipates.

WhereHows is downloadable from the LinkedIn GitHub server.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
User Rank: Apprentice
3/8/2016 | 10:59:04 AM
Great article, here's more info related to JSON schema
Interesting read about how the WhereHows open source data set tracker can manage structured and unstructured data such as JSON schemas. If you're interested in learning more about dynamically creating forms from JSON schema, check out my blog on the Eikos Partners website.
What Becomes of CFOs During Digital Transformation?
Joao-Pierre S. Ruth, Senior Writer,  2/4/2020
Fighting the Coronavirus with Analytics and GIS
Jessica Davis, Senior Editor, Enterprise Apps,  2/3/2020
IT Careers: 10 Job Skills in High Demand This Year
Cynthia Harvey, Freelance Journalist, InformationWeek,  2/3/2020
White Papers
Register for InformationWeek Newsletters
State of the Cloud
State of the Cloud
Cloud has drastically changed how IT organizations consume and deploy services in the digital age. This research report will delve into public, private and hybrid cloud adoption trends, with a special focus on infrastructure as a service and its role in the enterprise. Find out the challenges organizations are experiencing, and the technologies and strategies they are using to manage and mitigate those challenges today.
Current Issue
IT 2020: A Look Ahead
Are you ready for the critical changes that will occur in 2020? We've compiled editor insights from the best of our network (Dark Reading, Data Center Knowledge, InformationWeek, ITPro Today and Network Computing) to deliver to you a look at the trends, technologies, and threats that are emerging in the coming year. Download it today!
Flash Poll