Google probably has even larger clusters running one of the building blocks of Hadoop, called MapReduce, which originated inside of Google. MapReduce knows where the data you're about to analyze is coming from--which disk drives--and it connects that understanding to a map of the processors available, assigning data processing to the processor closest to its point of origin. This allows a lot of data to flow off of hundreds or thousands of disk drives at a time, hit an analysis point very quickly and produce results, which are aggregated into some grand result, such as the answer to a search query.
The data would have been put on the disks in the same manner, striped across many drives so that a single data set could be found by extracting 64 megabyte chunks off each of 1,000 disks. In effect, analysis of 64 gigabytes takes only slightly longer than analysis of 64 megabytes. That's cloud computing. It builds out a large cluster of highly similar servers, manages them as a unit, operates them as a parallel machine, exploiting distributed memory, distributed processing and distributed storage, and achieves big results -- using low cost parts.
The other part of Hadoop is the Hadoop Distributed File System, which enables very large data sets to stored on many disks and retrieved, using parallel methods.
I believe, but do not know for sure, that MapReduce-style operations are the secret to the marvelous Google search engine, which achieves so much for each user in about a second's worth of processing.
But search and the related job of indexing the Web are not the only tasks that Hadoop (and MapReduce) are good for. Hadoop can make the Web more personal by analyzing the activity of individual visitors to Yahoo and then serving them the ads that are most suited to their interests, making such advertising less hit or miss. Relational database, on the other hand, is good for more precise tasks that consume structured data. Hadoop is more the baleen whale of databases, taking in masses of unstructured material at a gulp without too much discrimination.
Hadoop and other cloud software has another important characteristic not found in the enterprise world. If its going to run on a large cluster, then it's going to experience hardware component failures that can't be allowed to bring the whole operation to a grinding halt. So Hadoop doesn't generate one copy of the data but two or three. It recognizes a hardware failure when one occurs, and far from shutting things down, turns to a replicated copy and tells another processor to pick up the workload.
This is how "clouds" on the Internet differ from clusters in the enterprise. They are engineered to tolerate hardware failure in the software. Fault tolerance has been kicked upstairs from the hardware to the software, where it's cheaper to supply if you just throw enough inexpensive hardware at it. In this, cloud software resembles the Internet itself, which was designed to tolerate hardware failure by detecting and routing around it. The Internet keeps running, no matter what. Hadoop is designed to do the same, and future cloud software will share this characteristic.
Who cares? Well, "half the start-ups in the Silicon Valley use Hadoop on Amazon (Amazon's EC2)," said Eric Baldeschweiler, Yahoo's VP of Hadoop software development, Nov. 3 at the Cloud Computing Conference & Expo in Santa Clara. They might teach you more about Hadoop, if they find a way to use it disruptively for competitive advantage against your company.
In writing about Hadoop earlier this week, I cited Yahoo's use of Hadoop and noted that Yahoo makes available for free its tested production version of Hadoop, a boon to mankind and those who wish to use cloud style data analysis. It should also be noted that Yahoo invests in Hadoop's continued development, and gives that development to the Apache open source Hadoop code base. Of roughly 20 committers to the incubator project, 11 work at Yahoo. Yahoo offers no public cloud resource for rent by the hour, as Amazon does with EC2 and Google does with Google AppEngine. Instead, Yahoo is concentrating on building services spawned by an internal, private cloud that is in the process of being built out.
How is the cloud different from predecessor forms of computing? It is an evolutionary outgrowth of them, with potentially revolutionary results. "Cloud is a promise… a long journey," said Surendra Reddy, VP of Yahoo's Integrated Cloud and Virtualization Group. And that journey has just begun.
InformationWeek and Dr. Dobb's have published an in-depth report on how Web application development is moving to online platforms. Download the report here (registration required).