Cloud // Infrastructure as a Service
Commentary
2/14/2011
02:53 PM
Charles Babcock
Charles Babcock
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Leading Developers Dismiss Charge That Cloud Is 'Vapor'

Hadoop originator Doug Cutting, NASA CIO Chris Kemp, and other cloud pioneers defended cloud computing during a CTO Forum panel and pointed to an emerging generation of enterprise applications.

To touch in a more basic way on the origins of cloud software, listen to Cutting talk about the origin of Hadoop, a piece of software meant to simplify management of large server clusters and take advantage of the distributed disk, CPU, and memory in a new way.

In response to an InformationWeek question, he said Hadoop came about "as I was working on Nutch, a crawler-based, whole Web search engine," which built a database of pages on the Web, parsers to assess the pages, and links between pages to set page rank scores.

Cutting developed a mechanism to sort all this information, using a little cluster of 4-5 machines, a system that was "a real pain to operate. Then in December 2004 Google published a paper on MapReduce. The techniques of MapReduce were obviously the solution to our problem. They published a way to build a framework using the algorithms we were using," he noted.

MapReduce is a method of taking a huge data set and distributing it across a large server cluster. Each server analyzes only the portion of the data that has been distributed to it with a large data set acted on in a parallel sequence across the cluster. A master server collects and reduces all the answers to the one answer sought.

"Mike Cafarella and I spent a couple of years developing Hadoop on what was a larger server cluster of 20-40 machines. [Hadoop] ran roughly," he recalled. Although a core system had been built, "it became obvious to me that it needed to run on hundreds or thousands of servers and it needed more than the two of us working half time on a shoe string."

In 1996-97, Cutting and Cafarella supervised the production implementation of Hadoop at Yahoo. At that point, Hadoop combined MapReduce with the Hadoop Distributed File System, which could distribute and store large files in 64 megabyte chunks, allowing masses of data to be retrieved, sorted, and manipulated quickly on a large cluster. Hadoop became an Apache Software Foundation open source project and for several years Cutting lead a large team of developers at Yahoo working on a production version.

Although Hadoop was conceived of as a component of a whole Web search engine, "people started jumping on it for all sorts of other things. There were 20-30 developers vying for time at Yahoo to use Hadoop" in their own projects, including learning about a visitor while he was on the Yahoo site and tracking the effectiveness of advertising on the Web.

Cutting is tall, lean, and of serious mein, compared to the freshly scrubbed faces at Facebook. (If you don't know what I mean, go see "The Social Network"). He seems mature compared to the Web 2.0 generation, as if the stress of competition of an earlier era had more routinely lined the faces of programmers. Indeed, for his work on Hadoop, Cutting has become an elder statesman of open source code. He was elected to the Apache Foundation's board of directors in 2009 and elected president in September of 2010.

While he still actively follows Hadoop developments, he no longer contributes code, he said in an interview after the CTO Forum panel. (He still contributes to the open source search engine project, Lucene.) Much of the development around Hadoop today is about making it easier to use and adding on external functionality to the core system. "Hadoop is like the Linux kernel in a couple of ways. The vast majority of the work is outside the kernel," he said. Work is progressing on Hive, a data warehouse that works on top of Hadoop; the HBase NoSQL system and the Pig language for programming use of Hadoop more than the core itself, he said.

Cutting said he wanted a simple, easy to remember name for his system and chose Hadoop, which is not an acronym, but the name used by his child for a stuffed, yellow elephant.

Previous
2 of 2
Next
Comment  | 
Print  | 
More Insights
Multicloud Infrastructure & Application Management
Multicloud Infrastructure & Application Management
Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July 22, 2014
Sophisticated attacks demand real-time risk management and continuous monitoring. Here's how federal agencies are meeting that challenge.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A UBM Tech Radio episode on the changing economics of Flash storage used in data tiering -- sponsored by Dell.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.