Software // Information Management
News
11/7/2011
11:20 AM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Hadoop Spurs Big Data Revolution

Open source data processing platform has won over Web giants for its low cost, scalability, and flexibility. Now Hadoop will make its way into more enterprises.

Hadoop Basics

Inspired in large part by a 2004 white paper in which Google described its use of MapReduce techniques, Hadoop is a Java-based software framework for distributed processing of data-intensive transformations and analyses. MapReduce breaks a big data problem into subproblems; distributes them onto tens, hundreds, and even thousands of processing nodes; and then combines the results into a smaller, easy-to-analyze data set.

Hadoop includes several important subprojects and related Apache projects. The Hadoop Distributed File System (HDFS) gives the platform massive yet low-cost storage capacity. The Pig data-flow language is used to write parallel processing jobs. The HBase distributed, column-oriented database gives Hadoop a structured-data storage option for large tables. And the Hive distributed data warehouse supports data summarization and ad hoc querying.

Hadoop gets its well-known scalability from its ability to distribute large-scale data processing jobs across thousands of compute nodes built on low-cost x86 servers. Its capacity is constantly increasing, thanks to Moore's Law and ever-rising memory and disk drive capacity. The latest supporting hardware deployments combine 16 compute cores, 128 MB of RAM, and as much as 12 TB or even 24 TB of hard disk capacity per node. The cost of each node is about $4,000, according to Cloudera, the leading provider of commercial support and enterprise management software for Hadoop deployments. That cost is a fraction of the $10,000 to $12,000 per terabyte for the most competitively priced relational database deployments.

This high-capacity and low-cost combination is compelling enough, but Hadoop's other appeal is its ability to handle mixed data types. It can manage structured data as well as highly variable data sources, such as sensor and server log files and Web clickstreams. It can also manage unstructured, text-centric data sources, such as feeds from Facebook and Twitter. ("Loosely structured" or "free form" are actually more accurate descriptions of this type of data, but "unstructured" is the description that has stuck.)

This ability to handle various types of data is so important it has spawned the broader NoSQL (not only SQL) movement. Platforms and products, such as Cassandra, CouchDB, MongoDB, and Oracle's new NoSQL database, address the need for data flexibility in transactional processing. Hadoop has garnered most of the attention for supporting data analysis.

Relational databases, such as IBM DB2, Oracle, Microsoft SQL Server, and MySQL, can't handle mixed data types and unstructured data, because they don't fit into the columns and rows of a predefined data model (see "Hadoop's Flexibility Wins Over Online Data Provider").

Previous
2 of 5
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
6/15/2012 | 10:10:09 PM
re: Hadoop Spurs Big Data Revolution
That's per core, but these stats have all been surpassed with the latest hardware.
molloy
50%
50%
molloy,
User Rank: Apprentice
12/6/2011 | 4:08:09 AM
re: Hadoop Spurs Big Data Revolution
Reading through the whole document I see only one mention of Yahoo, and no mention of Yahoo as the originator of Hadoop. It sometimes appears that the Press is intent on highlighting all of Yahoo's weaknesses, and none of it's strengths. Perhaps you think this information is already well-known, but the pie-chart showing that 74% have "no current or planned use" would suggest otherwise. For those who wish to read more meaty detail, see http://developer.yahoo.com/had....
IKODUKULA945
50%
50%
IKODUKULA945,
User Rank: Apprentice
12/4/2011 | 8:59:29 PM
re: Hadoop Spurs Big Data Revolution
Matspca - we're working on establishing a benchmark for Hadoop. If you'd like to participate, please let me know at indu.kodukula@sungard.com
matspca
50%
50%
matspca,
User Rank: Apprentice
11/30/2011 | 11:01:45 PM
re: Hadoop Spurs Big Data Revolution
Not everyone believes in the Hype of Hadoop. See http://www.vertica.com/2011/09... The big organizations mentioned here can afford to use non optimal solutions. I have seen no benchmark showing Hadoop beating say Oracle. My own noSQL database beats Hadoop by a large margin using $330 PC verses $1 million (or so) used by Hadoop for the same benchmark. See http://www.velocitydb.com/Comp...

I will continue following the Hype of Hadoop and if there really is some substance behind it then I look forward to a .NET version of the distribution mechanism.
RodneyG79
50%
50%
RodneyG79,
User Rank: Apprentice
11/10/2011 | 8:48:32 PM
re: Hadoop Spurs Big Data Revolution
128 MB of RAM for 16 cores? That has to be typo.
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.