Software // Information Management
News
11/7/2011
11:20 AM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Hadoop Spurs Big Data Revolution

Open source data processing platform has won over Web giants for its low cost, scalability, and flexibility. Now Hadoop will make its way into more enterprises.

R&D Roots At AOL

AOL has been using Hadoop for more than three years, first in its R&D unit, to make sense of the navigation patterns of the more than 180 million unique site visitors per month across AOL.com, MapQuest, the Huffington Post, and dozens of other sites it owns.

AOL starts by gathering as much information as possible about visitors' activities. That's where Hadoop's low-cost and scalability come in. "When you do the math, the cost per node of commodity systems versus commercial systems makes the choice very obvious," says Bao Nguyen, AOL's technical director of R&D for large-scale analytics. "The cost per node is orders of magnitude higher for the commercial systems."

AOL's R&D unit has a 300-node Hadoop deployment of mixed vintage and capacity in Mountain View, Calif. That system can store more than 500 TB of clickstream data on billions of events per day. An event can be someone clicking on an email promotion or banner ad, doing a search, reading an article, visiting a site, or clicking on a particular product on an e-commerce page. Events can also include time stamps added to the history and profile of a particular visitor (known by a particular cookie ID number but not by personally identifiable information).

This clickstream data is highly structured, but it's so massive and varied that it would be next to impossible to handle all the extract, transform, and load work that would be required to move it into a conventional relational database. AOL uses Hadoop's MapReduce processes to filter and correlate data, distributing text extraction, correlation, and calculation steps across hundreds of compute nodes.

With MapReduce job after MapReduce job, AOL refines massive amounts of raw data into thousands of categories, such as automobiles, news, finance, and sports. Next, it identifies features and attributes of the visitors to each category, determining whether they're car buyers, mortgage prospects, male heads of household, or teenagers, for example.

It feeds the final refined feature sets into more proprietary analytic applications (many built out on conventional relational platforms) that get down to the business priorities of delivering the right ad banners and email campaigns to the right people at the right time.

When online behavior shows that a visitor is interested in cars, Hadoop helps AOL figure that out and deliver a relevant ad. Hadoop is a batch-oriented platform, so it might take a day or two for such indicators to emerge. But profiles have a way of building over time and providing rich, multi-attribute targeting possibilities.

The success of the R&D Hadoop deployment led AOL to deploy an even larger, 700-node production system in April at its Dulles, Va., headquarters. The R&D unit now does more exploratory and ad hoc analyses, while the petabyte-scale production deployment does proven analyses, such as routine customer segmentation and online behavioral analysis. For example, an ad-targeting model running on the production deployment correlates data on the online and offline buying behavior of customers of large retailers that have both physical and online stores. AOL uses this anonymized data to build customer profiles and predictive models that let it aim online advertising at its 180 million unique online visitors per month.

Previous
3 of 5
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
6/15/2012 | 10:10:09 PM
re: Hadoop Spurs Big Data Revolution
That's per core, but these stats have all been surpassed with the latest hardware.
molloy
50%
50%
molloy,
User Rank: Apprentice
12/6/2011 | 4:08:09 AM
re: Hadoop Spurs Big Data Revolution
Reading through the whole document I see only one mention of Yahoo, and no mention of Yahoo as the originator of Hadoop. It sometimes appears that the Press is intent on highlighting all of Yahoo's weaknesses, and none of it's strengths. Perhaps you think this information is already well-known, but the pie-chart showing that 74% have "no current or planned use" would suggest otherwise. For those who wish to read more meaty detail, see http://developer.yahoo.com/had....
IKODUKULA945
50%
50%
IKODUKULA945,
User Rank: Apprentice
12/4/2011 | 8:59:29 PM
re: Hadoop Spurs Big Data Revolution
Matspca - we're working on establishing a benchmark for Hadoop. If you'd like to participate, please let me know at indu.kodukula@sungard.com
matspca
50%
50%
matspca,
User Rank: Apprentice
11/30/2011 | 11:01:45 PM
re: Hadoop Spurs Big Data Revolution
Not everyone believes in the Hype of Hadoop. See http://www.vertica.com/2011/09... The big organizations mentioned here can afford to use non optimal solutions. I have seen no benchmark showing Hadoop beating say Oracle. My own noSQL database beats Hadoop by a large margin using $330 PC verses $1 million (or so) used by Hadoop for the same benchmark. See http://www.velocitydb.com/Comp...

I will continue following the Hype of Hadoop and if there really is some substance behind it then I look forward to a .NET version of the distribution mechanism.
RodneyG79
50%
50%
RodneyG79,
User Rank: Apprentice
11/10/2011 | 8:48:32 PM
re: Hadoop Spurs Big Data Revolution
128 MB of RAM for 16 cores? That has to be typo.
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A UBM Tech Radio episode on the changing economics of Flash storage used in data tiering -- sponsored by Dell.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.