So you want to build a Hadoop cluster, load up everything you know about your business, and start delving your customers' deepest wants and needs?
There's a lot of expertise available to help you set up a cluster, get it running, and start pounding out Java code for your MapReduce functions. Unfortunately, however, you're largely on your own when it comes to figuring out the optimal server configuration.
In a sense, that's not surprising. The newness of Hadoop means most focus is on software considerations; we're talking such fundamental notions as storing data where the processing will happen and then writing queries in Java -- or interpreted SQL using Hive or HBase or some other option. It's on understanding abstraction, since Hadoop lives in Java VMs running on top of Linux, often on top of Xen or VMware. It's on learning a whole new management paradigm. All these concepts come before tuning the Hadoop cluster, and that makes for a strategy so removed from the physical layer that there's little consensus on hardware specs. That's a problem. Server selection is important to your outcome, and it's certainly the most expensive part of the project.
Growing pains Hadoop, as a commercial product, is only a few years old. Given that, its uptake is nothing short of amazing. In InformationWeek's 2013 Big Data Survey, we asked about usage rates for 14 tools. While IBM, Microsoft, and Oracle finished ahead of Hadoop, 14% of respondents said they use Hadoop/MapReduce for their big data needs. It landed at least 6 percentage points ahead of Sybase IQ, Teradata EDW, EMC Greenplum DCA, and HP Vertica. That's especially impressive given that, where those products are delivered as preintegrated systems, Hadoop demands a lot of engineering.
Choosing the wrong hardware can make that engineering much more difficult than it needs to be. Before cutting a purchase order, IT teams must address three key architectural questions: What comprises a Hadoop cluster configuration, and how might you adjust based on your expected workloads? Is it worth considering an architecture that's not commodity, based on x86 and Linux? And finally, how does Hadoop 2.0 change things?
Remember, Hadoop traces its roots to Internet search engine companies -- Yahoo for Hadoop, Google for MapReduce. Their primary need is for computationally lightweight and very fast algorithms. And yes, these companies think in terms of many thousands of servers, so "optimization" means something very different for Google than for a typical enterprise. Heck, these guys build data centers near power plants to get a break on electrical costs, and they custom design servers and have them built to spec. Still, enterprise CIOs can learn a few things.
First, for search providers, the bottleneck is typically not the CPU. Rather, they're dealing with a throughput problem -- hit every URL on the web and sort matches for the best response to a particular query. That tells us that the trick is finding a high-throughput system that has a good CPU engine while remaining cost-effective, in both initial price and power consumption.
The problem is that today's state-of-the-art chips typically chew up about 130 watts, for Intel's line -- more in the RISC line. They're typically clocked as fast as possible to achieve maximum per-core performance, and vendors load each chip with 12 to 16 cores. Intel's current top offerings run at about 3.4 GHz with 15 cores per socket, and the ability to address up to 1.5 TB of memory per socket.
Systems like that are not what search engine companies buy. Rather, these chips are designed to work in servers with as many as 32 sockets and compete with proprietary Unix systems like HP Itanium, IBM Power8, and Sun/Oracle Sparc T5. At least at their maximum chip count configurations, they solve different problems than Hadoop clusters.
Now, if you ratchet back about 1 GHz from the clock speed, going from 3.5 to 2.5 GHz, and drop the number of cores from 15 to either eight or six in Intel's Ivy Bridge line, two good things happen. First, the retail price of the chip drops from about $6,000 to $600, and second, power consumption ebbs, from 155 watts to 60 to 70 watts per socket. Systems in this midrange category will use, for example, a quad-channel DDR-3 memory configuration running at 1.6 or 1.866 GHz. Quad-core chips in the Ivy Bridge line are affordable, and that's reflected in the server cost.
A typical server appropriate for an Internet search company is equivalent to a 2U system with eight to 12 3-TB to 4-TB SATA drives, a total memory configuration in the 96 GB range, and either four 1-Gbps or two 10-Gbps Ethernet ports. Ask your incumbent server vendor for such a system and you'll likely find it doesn't support that many drives in a 2U box, and it'll want to include RAID controllers as well as management cards and redundant power supplies.
Don't bite -- Hadoop takes care of failures at the software level, so those RAID controllers, redundant power supplies, and out-of-band management systems are just extra costs. The place to spend money in a Hadoop cluster is for fairly fat memory configurations as well as enterprise-class hard drives, since they're the top source of failure.
If you're used to the drive power demand rates of a few years ago, the idea of stuffing 12 or more drives into a system would seem to dwarf concerns about processor power consumption. That's not true today. Typical 3-TB or 4-TB drives running at 5,400 to 7,200 RPMs will consume less than 10 watts in fully active mode and less than 8 watts when idle. Since reading from disk is the biggest source of latency and bottlenecks, it's worth considering 7,200-RPM drives. Packing nearly 100 GB of memory into the server will take some power, too. But because RAM speed is improved by better fabrication processes, you'll find that faster RAM doesn't necessarily use more power. In fact, power usage for RAM is typically 0.5 to 1 watt per gigabyte. So the worst case for our 96-GB configuration is an additional 96 watts. Networking too will consume power -- a dual 10-Gbps Ethernet card could draw as much as 16 watts, though 10 watts is more common. The typical quad 1-Gbps Ethernet card uses about half that much.
What's in a name (node)? The commodity setup we recommend above applies only to the data nodes in your Hadoop 1.0 system. The name node, in contrast, is a single point of failure. Its loss would be very bad news for your Hadoop cluster. Therefore, it should be treated more like a typical enterprise server and storage system. You might trust Hadoop's redundancy to protect data at the data node, but at the name node, you need that RAID protection, backups, and if possible a high-availability arrangement with a second passive server.
There are many ways to solve the name node problem, including use of an external SAN with typical data protection characteristics, like snapshot capability and regular backups. The system itself doesn't need to be more powerful than the data nodes, but there can be an advantage to faster networking, so even if you use 10-Gbps Ethernet nowhere else, it may be worthwhile on the name node.