Open source data processing platform has won over Web giants for its low cost, scalability, and flexibility. Now Hadoop will make its way into more enterprises.
Inspired in large part by a 2004 white paper in which Google described its use of MapReduce techniques, Hadoop is a Java-based software framework for distributed processing of data-intensive transformations and analyses. MapReduce breaks a big data problem into subproblems; distributes them onto tens, hundreds, and even thousands of processing nodes; and then combines the results into a smaller, easy-to-analyze data set.
Hadoop includes several important subprojects and related Apache projects. The Hadoop Distributed File System (HDFS) gives the platform massive yet low-cost storage capacity. The Pig data-flow language is used to write parallel processing jobs. The HBase distributed, column-oriented database gives Hadoop a structured-data storage option for large tables. And the Hive distributed data warehouse supports data summarization and ad hoc querying.
Hadoop gets its well-known scalability from its ability to distribute large-scale data processing jobs across thousands of compute nodes built on low-cost x86 servers. Its capacity is constantly increasing, thanks to Moore's Law and ever-rising memory and disk drive capacity. The latest supporting hardware deployments combine 16 compute cores, 128 MB of RAM, and as much as 12 TB or even 24 TB of hard disk capacity per node. The cost of each node is about $4,000, according to Cloudera, the leading provider of commercial support and enterprise management software for Hadoop deployments. That cost is a fraction of the $10,000 to $12,000 per terabyte for the most competitively priced relational database deployments.
This high-capacity and low-cost combination is compelling enough, but Hadoop's other appeal is its ability to handle mixed data types. It can manage structured data as well as highly variable data sources, such as sensor and server log files and Web clickstreams. It can also manage unstructured, text-centric data sources, such as feeds from Facebook and Twitter. ("Loosely structured" or "free form" are actually more accurate descriptions of this type of data, but "unstructured" is the description that has stuck.)
This ability to handle various types of data is so important it has spawned the broader NoSQL (not only SQL) movement. Platforms and products, such as Cassandra, CouchDB, MongoDB, and Oracle's new NoSQL database, address the need for data flexibility in transactional processing. Hadoop has garnered most of the attention for supporting data analysis.
Relational databases, such as IBM DB2, Oracle, Microsoft SQL Server, and MySQL, can't handle mixed data types and unstructured data, because they don't fit into the columns and rows of a predefined data model (see "Hadoop's Flexibility Wins Over Online Data Provider").
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.