IoT
IoT
Data Management // Big Data Analytics
News
2/12/2016
12:06 PM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

How eBay's Kylin Tool Makes Sense Of Big Data

When eBay needed to gain more value out of masses of Hadoop data, the company created its own an open source tool called Kylin. Now Kylin is an Apache Foundation project.

Hadoop At 10: Milestones And Momentum
Hadoop At 10: Milestones And Momentum
(Click image for larger view and slideshow.)

Inside the eBay operations "war room" last December, data analysts and data scientists had one big question on their minds as traffic approached its holiday crescendo: What was the hottest selling item among the 800 million available on the eBay website?

The answer wasn't one that many of them had expected.

"We found that every 12 seconds, we were selling a hoverboard," recalls Debashis Saha, vice president of Commerce Platform and Infrastructure. "It was our hottest-selling item" and one that previously hadn't even shown up on eBay's radar.

With that information in hand, eBay executives could contact suppliers and manufacturers of hoverboards, alert them to the unexpectedly high demand, and urge them to keep their manufacturing going and inventories stocked. It was a way of keeping customers satisfied and safeguarding eBay's own business, one made possible through a fast data analysis system called Kylin.

(Image: Nancy Nehring/iStockphoto)

(Image: Nancy Nehring/iStockphoto)

Kylin is open source code that began as a project inside eBay as it cast about for a tool that could help it make sense of all the data flowing into eBay's implementations of Hadoop.

By 2012 and 2013, there were already plenty of Hadoop front-end tools enhancing its basic distributed file system and MapReduce functionality.

However, eBay needed to be able to look at data in 10 billion rows from multiple angles, and do it quickly. In addition to its Hadoop-tolerant big data scientists, it had a staff of data analysts accustomed to working with the precision of ANSI-standard SQL queries. They were frustrated by the tools then available.

Apache Hive was an existing data warehouse system that worked with Hadoop. While it had SQL capabilities, it hadn't achieved the status of ANSI-standard operations at the time eBay needed them.

Sorting Through Data

"We had started to create a data ocean on Hadoop, but we weren't getting value out of it," recalled Saha in an interview with InformationWeek. Data analysts were exporting data out of Hadoop into OLAP and other SQL query-based systems, so they could find what they wanted, but that added steps to a process that needed to occur faster.

"We needed near real-time decisions on these extremely large data sets. Without them, we couldn’t respond fast enough," recalled Saha.

Furthermore, Saha was troubled by a growing gap between the data analysts who preferred to work with SQL and the data scientists accustomed to Hadoop limitations.

A small group of developers within his group set about addressing the problem in late 2013. By October 2014, they were far enough along with the SQL-standard, Hadoop-compatible Kylin project to propose it as an Apache Software Foundation project. A little over a year later, it was out of incubation and a fully-fledged, high-level project with 32 core developers.

Ten of them are eBay employees.

Kylin leverages Hadoop's ability to scale out to thousands of nodes on a server cluster and make use of the distributed processing enabled by MapReduce. At the same time, it can field SQL queries from a data visualization system like Tableau and return ANSI-standard results.

OLAP (online analytical processing) technology is not new. Building data cubes that can be viewed from a variety of angles was a well-established practice before Hadoop was invented. But Kylin enabled cube-building on a massive scale. Before the views can be achieved, hundreds of billions of rows in Hadoop must be indexed. Kylin’s ability to build "smart indexes" on that scale is one of the things that sets it apart, said Saha.

Debashis Saha, vice president of commerce platform and infrastructure at eBay.

Debashis Saha, vice president of commerce platform and infrastructure at eBay.

With the indexes already built, Kylin users can then achieve faster views and more useful results from large amounts of Hadoop data. "You can take a more granular level of the data and find results that satisfy these (specific) criteria," he said.

Broncos Win

Among other things, eBay data researchers wanted to know leading up to the Super Bowl what team paraphernalia was selling best.

Carolina Panthers gear was selling extremely well in their home region, but the Denver Broncos, and Peyton Manning in particular, had a broader appeal across much of the country. That information could guide eBay operations in making sure the right resources were behind the right memorabilia vendors.

A query handled by Kylin can obtain sub-second results from a data cube representing 10 billion rows, yielding information that's timely in terms of SuperBowl sales, Saha said. It completes 90% of its queries in five seconds or less, according to a Dec. 8 eBay blog post.

Kylin isn't the only tool invented at eBay to work with Hadoop.

It's developer teams have also produced Eagle, a data monitoring tool that quickly detects unauthorized access to sensitive data or malicious activity connected to data, as well as Pulsar, a data visualization and reporting framework. Both are also open source code.

However, Kylin has won the widest following. It's now used by many other companies, including Baidu, Expedia, JD.com, vip.com, and China Mobile.

[Are eBay operations fast enough? Read: Does eBay Fit in Instant Gratification Economy?]

"In eBay, we collect every user behavior on any eBay screen. While other OLAP engine struggles with the data volume, Kylin enables milliseconds response," Wilson Pang, eBay's senior director of behavior insights, wrote in the December blog.

"All together, Kylin serves as a critical backend component for eBay's product analytics platform... It's the best OLAP engine on big data so far," Pang wrote.

What have you done to advance the cause of Women in IT? Submit your entry now for InformationWeek's Women in IT Award. Full details and a submission form can be found here.

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
danielcawrey
50%
50%
danielcawrey,
User Rank: Ninja
2/15/2016 | 12:15:18 PM
Re: Kylin components use other parts of Hadoop project
What I really like about big data is the amazing insight that one can cull from data sources. We have the technology these days, it's just companies like eBay have to find tools to process them and gather insight. Hadoop has had a big part in that over the past few years. 
Brian.Dean
50%
50%
Brian.Dean,
User Rank: Ninja
2/15/2016 | 7:58:25 AM
Re: Kylin components use other parts of Hadoop project
Kylin can inform manufacturers about the quantity to produce and the timing to produce. If it has not already been implemented, I have a feeling that Kylin might also be utilized in the future to inform investors about the companies that would be a good investment -- from algorithmic trading to real-time algorithmic investment.
Li Tan
50%
50%
Li Tan,
User Rank: Ninja
2/14/2016 | 1:59:37 AM
Re: Kylin components use other parts of Hadoop project
It's a trend for big internet giants to have self-cooked big data tools. The data is a gold mine and it depends how much value you can dig out from it.
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
2/12/2016 | 4:32:56 PM
Kylin components use other parts of Hadoop project
Kylin can read data from Hive, run sorting and pre-calculations against the data via MapReduce and store data as cubes in HBase, using Zookeeper to coordinate jobs, according to some of the project's documentation. It has a Metadata Manager component, a REST Server, an ODBC Driver, a Query Engine and a Storage Engine.

 
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
Top IT Trends to Watch in Financial Services
IT pros at banks, investment houses, insurance companies, and other financial services organizations are focused on a range of issues, from peer-to-peer lending to cybersecurity to performance, agility, and compliance. It all matters.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of September 25, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.