That's why VMware is launching an open source project, Serengeti, to come up with changes to Hadoop that enable it to thrive on virtual as well as physical servers. It will offer the resulting code as open source under the Apache 2.0 license, and contribute the code's extensions to the core Hadoop project that's part of the Apache Software Foundation.
Hadoop has quickly become a primary handler of big data. It's open source code initially developed by Doug Cutting, now an architect at Cloudera, a Hadoop vendor. Hadoop combines a distributed file system with MapReduce, a method of assigning compute work to a node in a cluster that is closest to the source of the data.
For the most part, Hadoop requires a dedicated server cluster to do its work, and that's an expensive proposition in most working IT shops. If it could run in a virtualized environment on virtual servers, it could be activated and de-activated more easily, said Fausto Ibarra, senior director of product management at VMware.
[ Want to learn why Hadoop is proving increasingly useful inside the enterprise? See With Hadoop, Big Data Analytics Challenges Old School Business Intelligence. ]
In a virtual environment, it would be easier to overcome two Hadoop points of failure: in NameNode and JobTracker. If it's in NameNode, the Hadoop function that goes down in a Hadoop cluster, the system freezes. If Hadoop were running in a virtualized environment, the loss of the NameNode server would cause a duplicate virtual server to be activated off disk and the system would resume running. Likewise with a Hadoop JobTracker node, another essential server in the cluster and another single point of failure on a physical cluster.
But Hadoop can't be put in a virtualized environment without becoming aware that it's working with virtual machines. How aware of VMs is it? "Not very," said Ibarra in an interview.
Hadoop keeps three copies of data so that one copy may be lost, but a primary and backup copy remain. In a virtual as well as physical setting, that should mean placement of each copy on different physical servers, since all three copies would be lost at the same time if stored in virtual machines on the same host. But Hadoop can't distinguish between a physical and virtual server so it wouldn't know how to correctly distribute the copies across a cluster.
If only a small amount of data was lost, it might still be enough to freeze up a complex query being run on a Hadoop cluster.
While Hadoop may be used in one or two projects at a company today, it could become a fixture of data center computing, available to many users, if it ran in a more predictable and manageable virtualized environment. "We're enabling mainstream use of Hadoop" inside the enterprise, Ibarra said.
A virtualization-aware Hadoop could also be offered as a public cloud service, Ibarra continued.
VMware developers will contribute code to extend Hadoop in the Serengeti project so it can run effectively in virtualized environments. The same extensions will be made available to the Apache Hadoop project, which produces the reference version of Hadoop for many Hadoop users. VMware is also inviting other Hadoop vendors, including Cloudera, GreenPlum, Hortonworks, MapR, and IBM, to participate in Serengeti and make use of its extensions.
Asked why not just contribute "virtualization aware" code to the core Apache project, Ibarra said that project was focused on producing the optimum, bare-bones Hadoop functionality, while VMware was concerned with coming up with a version of Hadoop that would be generally ready to install and run in virtualized environments. "That's why we're doing a new open source project," he said.
The Apache project is likely to pick up the "virtualization aware" extensions, but they can also be made available directly to Cloudera, Hortonworks, GreenPlum, etc. to incorporate into their tools and systems, he said.
Developers building a new Hadoop-based system would benefit if Hadoop ran in a virtualized environment. They frequently only need a small cluster to initiate a system, but as it starts dealing with more and more data, they need an easy means of enlarging their Hadoop cluster to keep it going. More virtual servers could be added to a running Hadoop cluster as long as the pool of virtualized resources was large enough.
Private clouds are more than a trendy buzzword--they represent Virtualization 2.0. For IT organizations willing to dispense with traditional application hosting models, a plethora of pure cloud software options beckons. Our Understanding Private Cloud Stacks report explains what's available. (Free registration required.)