Interest in Hadoop is booming, so it should be no surprise that commercial vendors are piling on with products that promise to make the open source big data platform more reliable, more versatile, less expensive (by reducing required hardware investments) or faster.
Enter EMC Isilon and RainStor, both of which say they're plugging gaps in Hadoop to meet enterprise-grade needs. Each vendor brings a new twist to HDFS, Hadoop's distributed file system. EMC Isilon has tied its network-attached storage to HDFS, while RainStor has added a database on top of the file system that promises high compression as well as support for SQL analysis.
With the latest upgrade of its NAS operating system, OneFS 6.5, released this week, EMC's Isilon Systems unit has integrated its NAS architecture with the HDFS protocol. This integration lets customers scale out storage in a distributed fashion, but on an Isilon NAS rather than on the commodity hardware typically used to run HDFS.
[ Want more on Hadoop? Read Hadoop Spurs Big Data Revolution. ]
Isilon's new Hadoop option will be most attractive to customers that already have the vendor's NAS. The combination lets them use the platform for multiple high-scale storage needs. There's no need to create a separate, commodity-hardware-based storage platform just for Hadoop--though customers will still have to have a clustered server environment to provide Hadoop's compute capacity (which is available by way of the EMC Data Computing Appliance).
Another benefit of Isilon's NAS is that it provides enterprise-class data protection capabilities, including snapshots, replication, and backup. This eliminates the single point of failure inherent in Hadoop's NameNode, which is the controlling node of a cluster that contains metadata about the files stored in each data node. Isilon's NAS ensures high availability, and snapshots can be used to rebuild the cluster in the unlikely event of a complete failure.
The Isilon NAS can be used with any Apache Hadoop distribution, including those from Cloudera, Hortonworks, and MapR. But part of the appeal of the vendor's new Hadoop support is one-stop shopping and support from EMC, which also offers the Greenplum HD community distribution of Apache Hadoop.
Somewhat confusingly, EMC also offers the recently renamed Greenplum MR distribution (formerly HD Enterprise Edition), which is based on MapR's distribution of Hadoop. MapR does away with the NameNode problem entirely by replacing HDFS with NFS (the Unix-based Network File System). MapR's proprietary components support high availability and, the vendor maintains, higher scalability and performance than HDFS. EMC bills Greenplum MR as its high-performance distribution, but the name change and new Isilon tie hint that EMC is hedging its bets with Apache and proprietary MapR Hadoop distributions.
Compress For Success
RainStor started working on the big data problem long before the term became fashionable. The eight-year-old company has focused mostly on high-scale archival storage to meet compliance needs. RainStor Big Data Analytics, introduced last month, puts the vendor's database technology on top of HDFS. The promise is high data compression--up to 40x--while supporting both SQL querying and MapReduce processing of that data.
Data compression is a gift that keeps on giving because it reduces storage requirements and cost. RainStor says Hadoop clusters can be 50% to 80% smaller in terms of storage capacity with its technology in place. If you already have a Hadoop cluster, adding RainStor will let you store as much as twice the data (depending on the data type) without adding hardware.
RainStor's database can query data in its compressed format, eliminating the un-compress step and improving performance. The caveat is that the technology is best suited to historical data that doesn't change, not fixed information that's constantly updated (like a customer database). The compression technology relies on value- and pattern-de-duplication techniques, so it's best suited to data that has repeating values or patterns. Log files, clickstreams and call data records fit this description (and are also historical records), but video, image and voice data do not.
More To Come
The EMC and RainStor announcements aren't the first of their kind, and they won't be the last. In November, Cloudera announced support for the NetApp Open Solution for Hadoop, a reference storage architecture based on the storage vendor's hardware. Like EMC Isilon's Hadoop offering, Open Solution decouples storage and compute capacity while promising higher availability and reliability than a conventional deployment.
RainStor's ability to run both SQL and MapReduce is appealing because you don't have to bother with moving large data sets between separate environments, a time-consuming task. Other vendors also offer a single point of access to Hadoop; in the case of Hadapt, for instance, it's all about using SQL, MapReduce, and related analytics from one spot. Compression isn't part of that story.
The Internet giants that pioneered Hadoop were used to its rough edges. Enterprises used to stability and high availability won't be so forgiving. It's a safe bet that more vendors and proprietary enhancements will emerge.