In response to an InformationWeek question, he said Hadoop came about "as I was working on Nutch, a crawler-based, whole Web search engine," which built a database of pages on the Web, parsers to assess the pages, and links between pages to set page rank scores.
Cutting developed a mechanism to sort all this information, using a little cluster of 4-5 machines, a system that was "a real pain to operate. Then in December 2004 Google published a paper on MapReduce. The techniques of MapReduce were obviously the solution to our problem. They published a way to build a framework using the algorithms we were using," he noted.
MapReduce is a method of taking a huge data set and distributing it across a large server cluster. Each server analyzes only the portion of the data that has been distributed to it with a large data set acted on in a parallel sequence across the cluster. A master server collects and reduces all the answers to the one answer sought.
"Mike Cafarella and I spent a couple of years developing Hadoop on what was a larger server cluster of 20-40 machines. [Hadoop] ran roughly," he recalled. Although a core system had been built, "it became obvious to me that it needed to run on hundreds or thousands of servers and it needed more than the two of us working half time on a shoe string."
In 1996-97, Cutting and Cafarella supervised the production implementation of Hadoop at Yahoo. At that point, Hadoop combined MapReduce with the Hadoop Distributed File System, which could distribute and store large files in 64 megabyte chunks, allowing masses of data to be retrieved, sorted, and manipulated quickly on a large cluster. Hadoop became an Apache Software Foundation open source project and for several years Cutting lead a large team of developers at Yahoo working on a production version.
Although Hadoop was conceived of as a component of a whole Web search engine, "people started jumping on it for all sorts of other things. There were 20-30 developers vying for time at Yahoo to use Hadoop" in their own projects, including learning about a visitor while he was on the Yahoo site and tracking the effectiveness of advertising on the Web.
Cutting is tall, lean, and of serious mein, compared to the freshly scrubbed faces at Facebook. (If you don't know what I mean, go see "The Social Network"). He seems mature compared to the Web 2.0 generation, as if the stress of competition of an earlier era had more routinely lined the faces of programmers. Indeed, for his work on Hadoop, Cutting has become an elder statesman of open source code. He was elected to the Apache Foundation's board of directors in 2009 and elected president in September of 2010.
While he still actively follows Hadoop developments, he no longer contributes code, he said in an interview after the CTO Forum panel. (He still contributes to the open source search engine project, Lucene.) Much of the development around Hadoop today is about making it easier to use and adding on external functionality to the core system. "Hadoop is like the Linux kernel in a couple of ways. The vast majority of the work is outside the kernel," he said. Work is progressing on Hive, a data warehouse that works on top of Hadoop; the HBase NoSQL system and the Pig language for programming use of Hadoop more than the core itself, he said.
Cutting said he wanted a simple, easy to remember name for his system and chose Hadoop, which is not an acronym, but the name used by his child for a stuffed, yellow elephant.