Online consumer coupon and incentives company Ebates was founded in 1998 on the idea of providing consumers a way to earn cash back on online purchases. Today the company lets consumers earn cash back from purchases with over 2,000 online retailers, and it earns a commission when shopping is initiated through an affiliate link. It's a marketer's dream. Just imagine all the insights and opportunity made possible by analyzing all that data.
Yet just about 4 years ago, the company found itself struggling under the load. It had recently been acquired by Rakuten for $1 billion, but Ebates was still operating an on-premises data warehouse and ETL on a single SQL server. Storage was at about 100 terabytes and the company was processing a couple of terabytes a day, according to Ebates VP of Analytics Mark Stange-Tregear, who spoke with InformationWeek in an interview.
"Things were collapsing all the time," he said. Stange-Tregear joined the company a few months before it was acquired, and he was charged with the mission of getting the company's data and analytics operations back on track.
The question was how. The company initially considered getting a bigger machine.
"But there wasn't a bigger machine," Stange-Tregear said.
Hadoop POC fit the budget
After looking at the few potential solutions on the market at the time, Ebates chose to run its proof of concept on Hadoop on a cluster of "bargain-basement second-hand servers. It wasn't pretty," Stange-Tregear said. But the Hadoop POC was able to beat the SQL server ETLs in just a couple of months. It was enough to get executive buy-in on the idea of migrating the data to a Hadoop cluster operationally.
To move it to production, Ebates invested in 16 Dell machines to create the cluster and built out ETL and reporting on this new system.
"We learned a lot about the technology," Stange-Tregear said. "There was a pretty big learning curve for us there because everyone had come out of a traditional relational database background."
The whole process, from executive buy-in in early 2015 to most of the migration by the end of 2015 to decommissioning the SQL server by mid-2016, took a little more than a year. By the time it was over, Stange-Tregear's team realized they'd gotten more than they'd bargained for. But not in a bad way.
Working with non-structured data
"The initial concept was really just taking the relational database data sets and moving to a cluster so we could have additional power," he said. "But we realized that this really opened up the capability for us to do non-structured data in a much bigger way. We started ingesting click-stream data, behavioral data, user event data, even a little bit of sentiment data, and working that along with the core transactional data sets all on the same platform."
The current stack
As the company has grown, Ebates has again moved the production, this time to a 40-node cluster of Cisco machines with a replica cluster on the Dell hardware that has been expanded to 25 nodes. All of this operates on-premises. A cost/benefit evaluation revealed on-premises to be more cost effective than the cloud for Ebates' workloads when accounting for hardware costs, labor costs, and electricity costs. The stack includes Tableau as the enterprise business intelligence tool and AtScale as the data modeling and optimization tool. Other technologies include Python, Scala, R, Hive, Impala, Spark, and then Kafka and Flume to handle dataflow in and out. The data engineering team that builds the platform components consists of 15 people and the analytics team of power users (who also write code) consists of 16 people. About one-third of the 250 total users in the company access the system for insights.
A data hub, not a warehouse
Stange-Tregear said that he no longer considers the setup to be a data warehouse. Instead, it acts more like a data hub. Ebates uses the system to feed data to all kinds of applications both inside and outside of the company. It also powers most of the company's email program and effectively acts as a CRM tool, and does push notifications, too. In addition, the company has recently implemented an API layer that sits on top that can be called via front end applications such as the website, mobile apps, and browser extensions to pull back data sets for use within the application experience.
Beyond that, Ebates' new setup has enabled data modeling of extremely large volumes of data, including unstructured data, which has driven more sophisticated analytics, Stange-Tregear said. One of the simpler ones is an analysis of the best times of day to send communications to customers to maximize the likelihood they will open and click through.
Another example with a predictive analytics twist is a sort of recommendation engine. Ebates can predict which retailers a customer will like based on the customer's previous activities and then recommend that new retail option to the customer.
While everything so far has been on-premises, Stange-Tregear said that the company is indeed now looking to get involved with the cloud in a limited way where it makes financial sense. For instance, the company is considering the cloud for testing very large data science processes that it may not want to test on the core cluster. Ebates may also move some small pre-processed data sets up to the cloud when it knows the only interactions will consist of simple reads.
"We are looking at the cloud as a niche provider where it makes financial sense," Stange-Tregear said.