Scalability, flexibility, and low cost are the praises we hear repeatedly from Hadoop adopters, and Rapleaf is no exception. The company, which provides companies with data about their online customers, chose Hadoop to replace a MySQL relational database processing workflow nearly four years ago, and it's finding advantages in being able to quickly add new data types to meet changing business needs.
Rapleaf provides data that companies can add to their own customer data in order to do better online personalization and targeting. Like many data providers, Rapleaf trades in demographic information, such as age, income, gender, education, and marital status, as well as psychographic information, such as hobbies, activities, and interests. It partners with email service providers, such as Constant Contact and Exact Target, to help companies doing digital marketing campaigns.
Rapleaf gets its data from many sources, but the total amount it processes is modest--tens of terabytes compared with the hundreds of terabytes and petabytes that some Hadoop users process. In addition, where many Hadoop users churn through ever-changing data, such as clickstreams that constantly reveal people's latest online activities, Rapleaf's deployment reprocesses a fairly stable core of information, as the universe of households and Internet users doesn't change dramatically.
What does change are the attributes Rapleaf must provide from its stockpile of data, as marketers seek new ways to target consumers. That's where Hadoop's ability to tap new data sources and mix data types comes in. If Rapleaf used a processing system based on a traditional relational database, it would have to use a predefined schema or data model. A database about people, for instance, would require a table with specific columns for attributes such as age, gender, and income level. If it wanted to add new data containing attributes that weren't originally included such as Twitter handle or Facebook name, IT would face the time-consuming task of adding new columns to the table. The larger the table, the bigger the problem.
"Just adding one column to a large table within a relational database can easily take hours, days, or worse, and that's totally unacceptable," says Jeremy Lizt, Rapleaf's VP of engineering.
Using Hadoop, Rapleaf doesn't need to create a new column; it simply tweaks what it calls its "people profile," and new attributes can be extracted in the next round of data processing, adding new sources as necessary to derive the additional information required. Thanks to the platform's scalability and use of highly distributed MapReduce processing, that processing can happen within minutes.
In the early days of its deployment, the young Hadoop platform that Rapleaf was using didn't have a lot of tools and industry best practices to draw from. Hadoop has since matured, and Lizt says enterprise support provider Cloudera helped Rapleaf deal with bug fixes.
"Hadoop is a lot more stable today than it was when we started, and it's obviously going to continue to evolve because it just makes so much sense for anybody who needs to do large-scale data processing," Lizt says.
Hadoop Spurs Big Data Revolution