R&D Roots At AOL
AOL has been using Hadoop for more than three years, first in its R&D unit, to make sense of the navigation patterns of the more than 180 million unique site visitors per month across AOL.com, MapQuest, the Huffington Post, and dozens of other sites it owns.
AOL starts by gathering as much information as possible about visitors' activities. That's where Hadoop's low-cost and scalability come in. "When you do the math, the cost per node of commodity systems versus commercial systems makes the choice very obvious," says Bao Nguyen, AOL's technical director of R&D for large-scale analytics. "The cost per node is orders of magnitude higher for the commercial systems."
AOL's R&D unit has a 300-node Hadoop deployment of mixed vintage and capacity in Mountain View, Calif. That system can store more than 500 TB of clickstream data on billions of events per day. An event can be someone clicking on an email promotion or banner ad, doing a search, reading an article, visiting a site, or clicking on a particular product on an e-commerce page. Events can also include time stamps added to the history and profile of a particular visitor (known by a particular cookie ID number but not by personally identifiable information).
This clickstream data is highly structured, but it's so massive and varied that it would be next to impossible to handle all the extract, transform, and load work that would be required to move it into a conventional relational database. AOL uses Hadoop's MapReduce processes to filter and correlate data, distributing text extraction, correlation, and calculation steps across hundreds of compute nodes.
With MapReduce job after MapReduce job, AOL refines massive amounts of raw data into thousands of categories, such as automobiles, news, finance, and sports. Next, it identifies features and attributes of the visitors to each category, determining whether they're car buyers, mortgage prospects, male heads of household, or teenagers, for example.
It feeds the final refined feature sets into more proprietary analytic applications (many built out on conventional relational platforms) that get down to the business priorities of delivering the right ad banners and email campaigns to the right people at the right time.
When online behavior shows that a visitor is interested in cars, Hadoop helps AOL figure that out and deliver a relevant ad. Hadoop is a batch-oriented platform, so it might take a day or two for such indicators to emerge. But profiles have a way of building over time and providing rich, multi-attribute targeting possibilities.
The success of the R&D Hadoop deployment led AOL to deploy an even larger, 700-node production system in April at its Dulles, Va., headquarters. The R&D unit now does more exploratory and ad hoc analyses, while the petabyte-scale production deployment does proven analyses, such as routine customer segmentation and online behavioral analysis. For example, an ad-targeting model running on the production deployment correlates data on the online and offline buying behavior of customers of large retailers that have both physical and online stores. AOL uses this anonymized data to build customer profiles and predictive models that let it aim online advertising at its 180 million unique online visitors per month.