When some IT pros encounter Big Data, they think of big-name IT vendors. Others think of Google. They reckon a company that does a fantastic job searching the Web must know something about managing lots of data.
It does. Google issued a white paper in 2004 on MapReduce, its programming model for processing big data sets, and Google File System that has inspired a new approach to big data computing. Among the first champions were developers, mostly from Yahoo, who came up with Hadoop.
Now an Apache open source framework, Hadoop includes the Hadoop Distributed File System and a MapReduce engine. Think of MapReduce and Hadoop as alternatives for distributed big data processing that may deliver speed, cost, and flexibility advantages over just using massively parallel processing or column-store database options.
Barnes & Noble chose vendor Aster Data in part because it supports in-database MapReduce, which the bookseller thinks will help its data warehouse scale out and perform better. MapReduce lets researchers see trends more quickly than by only using massively parallel processing, says Marc Parrish, Barnes & Noble's VP of retention and loyalty marketing. With the old system, for example, a report on e-book downloads was getting delivered later and later in the day as e-book sales took off last year and the system was choking on the data. "When you're putting database table joins on joins on joins, it's much more efficient to move that query into a MapReduce environment," Parrish says.
Security software maker McAfee is using Hadoop in part because it can handle functions that just don't work well in relational databases. Text analysis, for example, may involve sparse data in which not all columns appear consistently. McAfee also used Hadoop for some high-scale enterprise data warehouse advantages when it consolidated data warehouses. McAfee previously had data warehouses for each type of threat it studied--spam, malware, firewall attacks. Bringing that data together lets McAfee see correlations and explicit connections between different types of threats and perpetrators, says Sven Krasser, McAfee's senior director of data mining research.
Not Easy To Use
The downside of MapReduce and Hadoop (and many emerging NoSQL platforms) is that they're immature, especially compared with SQL, which is now pushing 40 years old. The tools and interfaces are very version 1.0--at best. McAfee is using Datameer's tool for Hadoop search and is testing its tool for spreadsheet-style reporting and trend analysis, and both are in beta.
Another drawback: Most data warehousing and analytics professionals aren't used to their development environments--like Java, Python, and Perl--and may lack the technical depth needed.
Digital marketing firm Adknowledge turned to Hadoop several years ago when its first-generation Netezza deployment reached its scalability limits. The company, which uses predictive analytics to optimize online marketing, built an on-premises Hadoop deployment and later tapped Hadoop instances in the cloud, on Amazon EC2.
To consolidate, Adnowledge completed a 100-TB Greenplum data warehouse deployment in February. It chose Greenplum in part because it integrates with Hadoop, but now it's curtailing Hadoop use. It can give data access to a broader group of people in Greenplum. In Hadoop, people "may have to write code to process the data," says Matt Hoggatt, Adknowledge's director of software development.