When eBay needed to gain more value out of masses of Hadoop data, the company created its own an open source tool called Kylin. Now Kylin is an Apache Foundation project.
Hadoop At 10: Milestones And Momentum
(Click image for larger view and slideshow.)
Inside the eBay operations "war room" last December, data analysts and data scientists had one big question on their minds as traffic approached its holiday crescendo: What was the hottest selling item among the 800 million available on the eBay website?
The answer wasn't one that many of them had expected.
"We found that every 12 seconds, we were selling a hoverboard," recalls Debashis Saha, vice president of Commerce Platform and Infrastructure. "It was our hottest-selling item" and one that previously hadn't even shown up on eBay's radar.
With that information in hand, eBay executives could contact suppliers and manufacturers of hoverboards, alert them to the unexpectedly high demand, and urge them to keep their manufacturing going and inventories stocked. It was a way of keeping customers satisfied and safeguarding eBay's own business, one made possible through a fast data analysis system called Kylin.
(Image: Nancy Nehring/iStockphoto)
Kylin is open source code that began as a project inside eBay as it cast about for a tool that could help it make sense of all the data flowing into eBay's implementations of Hadoop.
By 2012 and 2013, there were already plenty of Hadoop front-end tools enhancing its basic distributed file system and MapReduce functionality.
However, eBay needed to be able to look at data in 10 billion rows from multiple angles, and do it quickly. In addition to its Hadoop-tolerant big data scientists, it had a staff of data analysts accustomed to working with the precision of ANSI-standard SQL queries. They were frustrated by the tools then available.
Apache Hive was an existing data warehouse system that worked with Hadoop. While it had SQL capabilities, it hadn't achieved the status of ANSI-standard operations at the time eBay needed them.
Sorting Through Data
"We had started to create a data ocean on Hadoop, but we weren't getting value out of it," recalled Saha in an interview with InformationWeek. Data analysts were exporting data out of Hadoop into OLAP and other SQL query-based systems, so they could find what they wanted, but that added steps to a process that needed to occur faster.
"We needed near real-time decisions on these extremely large data sets. Without them, we couldn’t respond fast enough," recalled Saha.
Furthermore, Saha was troubled by a growing gap between the data analysts who preferred to work with SQL and the data scientists accustomed to Hadoop limitations.
A small group of developers within his group set about addressing the problem in late 2013. By October 2014, they were far enough along with the SQL-standard, Hadoop-compatible Kylin project to propose it as an Apache Software Foundation project. A little over a year later, it was out of incubation and a fully-fledged, high-level project with 32 core developers.
Ten of them are eBay employees.
Kylin leverages Hadoop's ability to scale out to thousands of nodes on a server cluster and make use of the distributed processing enabled by MapReduce. At the same time, it can field SQL queries from a data visualization system like Tableau and return ANSI-standard results.
OLAP (online analytical processing) technology is not new. Building data cubes that can be viewed from a variety of angles was a well-established practice before Hadoop was invented. But Kylin enabled cube-building on a massive scale. Before the views can be achieved, hundreds of billions of rows in Hadoop must be indexed. Kylin’s ability to build "smart indexes" on that scale is one of the things that sets it apart, said Saha.
Debashis Saha, vice president of commerce platform and infrastructure at eBay.
With the indexes already built, Kylin users can then achieve faster views and more useful results from large amounts of Hadoop data. "You can take a more granular level of the data and find results that satisfy these (specific) criteria," he said.
Among other things, eBay data researchers wanted to know leading up to the Super Bowl what team paraphernalia was selling best.
Carolina Panthers gear was selling extremely well in their home region, but the Denver Broncos, and Peyton Manning in particular, had a broader appeal across much of the country. That information could guide eBay operations in making sure the right resources were behind the right memorabilia vendors.
A query handled by Kylin can obtain sub-second results from a data cube representing 10 billion rows, yielding information that's timely in terms of SuperBowl sales, Saha said. It completes 90% of its queries in five seconds or less, according to a Dec. 8 eBay blog post.
Kylin isn't the only tool invented at eBay to work with Hadoop.
It's developer teams have also produced Eagle, a data monitoring tool that quickly detects unauthorized access to sensitive data or malicious activity connected to data, as well as Pulsar, a data visualization and reporting framework. Both are also open source code.
However, Kylin has won the widest following. It's now used by many other companies, including Baidu, Expedia, JD.com, vip.com, and China Mobile.
"In eBay, we collect every user behavior on any eBay screen. While other OLAP engine struggles with the data volume, Kylin enables milliseconds response," Wilson Pang, eBay's senior director of behavior insights, wrote in the December blog.
"All together, Kylin serves as a critical backend component for eBay's product analytics platform... It's the best OLAP engine on big data so far," Pang wrote.
What have you done to advance the cause of Women in IT? Submit your entry now for InformationWeek's Women in IT Award. Full details and a submission form can be found here.
Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive ... View Full Bio
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business won’t wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.
Top IT Trends to Watch in Financial ServicesIT pros at banks, investment houses, insurance companies, and other financial services organizations are focused on a range of issues, from peer-to-peer lending to cybersecurity to performance, agility, and compliance. It all matters.
Join us for a roundup of the top stories on InformationWeek.com for the week of September 25, 2016. We'll be talking with the InformationWeek.com editors and correspondents who brought you the top stories of the week to get the "story behind the story."