It was Hadoop, of course, that took four terabytes of scanned archives from the New York Times and converted them to PDFs for display on the Time's Web site. It accomplished the task in less than 24 hours, using 100 machines in the Amazon EC2 cloud. This was one of the incidents that started to give cloud computing a good name back in 2007.
Mike Olson, CEO of Cloudera and former head of the company behind BerkeleyDB, says the launch of Cloudera Enterprise June 29 was intended to shift Hadoop use out beyond the hands of skilled Java programmers into a broader set of users. Currently, it takes a programmer to feel comfortable with Hadoop's command line interface. With Cloudera Enterprise, a Hadoop administrator gets graphical tools to "monitor, manage and control access to a Hadoop cluster," including means to provision new servers for the cluster, accept identity management supplied by Active Directory or LDAP identity management systems, and connect Hadoop to various systems monitoring systems, Olson said in an interview.
The goal is to smooth the deployment of Hadoop to take on the task of sorting and managing the masses of data being generated on Web sites, on trading exchanges and in scientific research projects. "Managing hundreds of machines in a cluster is always a problem," Olson said, and Hadoop users need all the help they can get to make use of the growing reams of data available to them.
In effect, Cloudera Enterprise is the Cloudera distribution of Hadoop itself, a production tested version, combined with the tools and the user interface it's been able to layer on top. It has rolled other open source code used with Hadoop into the package, such as the Hadoop programming language, PIG, and the data warehouse system built on Hadoop, Hive. The announcement of Cloudera Enterprise didn't roil the waters all that much. Cloudera was expected to bring out a front end set of management tools and it's did so at the Hadoop Summit held at Yahoo June 29. New users of these tools are likely to push Hadoop forward into a larger presence in cloud computing and monumental Web data handling tasks.
A major user of Hadoop is eBay and Anil Madan, director of engineering, analytics platform development, said Cloudera Enterprise is a welcome addition to his daily task of coping with a mountain of data. "These new tools make it easy to perform critical activities including user access, authorization and lifecycle management of end user jobs," he said in the announcement.
Hadoop is available for free download from the Apache Software Foundation. It is an early stage project, still in the Apache Incubator, where project governance and initial mailing lists and methods of operation are set up. A production version of Hadoop is also distributed free by Yahoo, which makes use of the system itself.
Emerging technology always comes with a learning curve. Here are some real-world lessons about cloud computing from early adopters. Download the latest all-digital issue of InformationWeek for that story and more. (Free registration required.)