The partnership is clearly a response to an alliance struck in May between NetApp rival EMC and MapR Technologies. As part of that deal, EMC entered the Hadoop enterprise support business in direct competition with Cloudera, and it incorporated MapR's software as part of a Greenplum HD Enterprise Edition Hadoop software distribution.
A java-based platform for distributed data processing, Hadoop has gained interest and adoption in recent years on the strength of its ability to handle the big data encountered by Internet businesses and other organizations handling hundreds of terabytes if not petabytes of information. The storage opportunity has naturally attracted storage vendors EMC, which has $17 billion in annual revenue, and NetApp, its smaller rival with $5 billion in annual revenue.
[ Want more on Hadoop and NoSQL alternatives? Read Disruptive Tech Changes IT's Database Choices. ]
Cloudera is the oldest and largest provider of enterprise support and Hadoop management software, with more than 100 customers, but it's a tiny company compared to NetApp and stands to gain a huge lift through that vendor's sales and distribution organization. As for the NetApp Open Solution for Hadoop, a reference architecture sounds like something you can hash out on a napkin, but the partners say the suggested configurations of software and hardware will speed deployment and have been tested in NetApp's labs to ensure performance.
Further, whereas the commodity servers typically used in Hadoop deployments limit flexibility of compute-capacity-to-storage ratios, Cloudera and NetApp say the Open Solution decouples storage and compute while also providing higher availability and reliability, and improved manageability for enterprise environments.
"In our approach, compute capacity can grow at the rate of the application requirements and storage can grow at the rate of the data requirements, and we think that's a huge benefit as customers start to build out their workloads," said Jeffrey O'Neal, NetApp's senior director of data center solutions.
As an example, where Hadoop nodes on pizza-box-style commodity servers often house eight drives, O'Neal said NetApp's hardware can put up to 14 2-TB drives behind a single computer node with provisions for hot spares for better reliability. RAID storage is also built in for data protection. Disk drives are configured on trays, and because there are hot spares, failed drives can be swapped out without bringing down a node and removing a server. Further, the architecture also provides a NetApp NFS (Network File System) backup protection for the named node, a single point of failure in Hadoop deployments because the named node controls all other nodes.
Pointing out a key contrast with EMC's Hadoop offering, which is intended to be run on the EMC Greenplum Modular Data Computing Appliance (DCA), O'Neal said that the NetApp Open Solution does not "force you to use a particular database." That's a reference to the fact that the DCA can also run EMC's Greenplum database (but no other databases) for conventional relational data warehousing needs.
"Cloudera has supported connectivity with a huge variety of databases, including everything from Teradata and Netezza to Oracle, MySQL, and Vertica," said Ed Albanese, Cloudera's head of business development.
With IBM having released Hadoop-based BigInsights software and support, and Oracle and Microsoft having announced their intention to add their own Hadoop distributions and support (along with a Big Data Appliance from Oracle), it's clear that this data processing platform is headed for wider use.
Likening the NetApp-Cloudera reference architecture to an appliance configuration, MapR CEO John Schroeder said in a statement that the entry of commercial vendors into the Hadoop market will help make it "a safe choice" as a big data platform. "Most organizations run Hadoop by installing software on commodity hardware where you can purchase terabyte drives for less than $100," he said. "We'll see how the market responds to Hadoop appliance offerings."