Image Gallery: Yahoo!'s Hadoop Implementation
|(click for larger image and for full photo gallery)|
Inspired by Google research and use of Map/Reduce and distributed-file-system approaches, Hadoop is a Java-based software framework for distributed processing of data-intensive transformations and analyses. The software automatically distributes processing of up to petabytes of data across thousands of nodes on low-cost, commodity hardware. Hadoop is being used for data transformations or first-cut analyses across vast data sets that can be handled either more quickly or cost effectively (and sometimes both) than in conventional data warehouses. Much smaller result sets from Hadoop can be brought into mainstream environments for conventional reporting and analysis.
Both Talend and Quest have introduced data-integration software that exchanges data between Hadoop and conventional databases and data warehouses. Talend yesterday announced Hadoop integration through the Talend Integration Suite 4.0, which is already available. The suite includes native support for both Hadoop Distributed File System (HDFS) and the Hive database infrastructure built on top of Hadoop.
"We can put data into Hadoop, we can extract data from Hadoop using HDFS, and in the middle, we can generate Hive queries to run data transformations directly inside the Hive database," said Fabrice Bonan, Talend's co-founder and COO.
Talend's approach makes it possible for customers to combine Hadoop-based integration with traditional Extract-Transform-Load (ETL) or alternative Extract-Load-Transform (ELT) data integration processes. The ELT approach pushes transformation into the database, an approach that plays to the distributed-processing strengths of Hadoop.
Quest Software last week announced two products related to Hadoop. The first, a data-access and data-management tool called Toad for Cloud Databases, will take advantage of Hadoop's ability to process non-relational data such as e-mail, images and XML. A beta product released last week addresses cloud-based databases such as Amazon SimpleDB, Microsoft Azure Table services and Apache HBase; a planned second beta release will support so-called NoSQL databases including Apache Cassandra and Apache Hadoop through Hive.
Quests' second announcement is the planned release, by year-end 2010, of a high-speed Oracle-to-Hadoop data-transfer tool. Code-named Ora-Oop, the utility will provide an interface for high-performance, bidirectional data transfer between Hadoop and Oracle. The tool is being developed in conjunction with Cloudera, which provides enterprise-grade support for the Cloudera Distribution for Hadoop (CDH). Ora-Oop will be supported by both vendors, and Cloudera says the tool will complement Squoop, its existing open-source SQL-to-Hadoop database import tool included in CDH.
In a blue-chip example of Hadoop support, IBM last month announced consulting services for managing large volumes of data on Hadoop. As InformationWeek reported in this story, IBM calls its package of services and Hadoop-based analytics BigInsights Core.
Given Hadoop's cost and speed advantages, Gartner analyst Don Feinberg believes it will be a fast-growing choice for cutting large-scale-data, complex-data and mixed-data analyses down to size. Since smaller result sets will then be brought into more conventional warehouses and BI environments, connectors will become commonplace.
"I think all the DBMS vendors will eventually create a high-speed interface like Ora-Oop to get data into and out of Hadoop," Feinberg said. "If it costs me a million dollars to add a node to my enterprise data warehouse, why do I want to do that when I can handle the big data with a bunch of cheap, commodity servers and disk drives?"
Alternative analytic database providers including Vertica and Greenplum have already added connectors between Hadoop and their database management systems. With Talend and Quest piling on, the trend is well underway.