Hadoop software is designed to orchestrate massively parallel processing on relatively low-cost servers that pack plenty of storage close to the processing power. All the power, reliability, redundancy, and fault tolerance are built into the software, which distributes the data and processing across tens, hundreds, or even thousands of "nodes" in a clustered server configuration.
Those nodes are "industry standard" x86 servers that cost $2,500 to $15,000 each, depending on CPU, RAM, and disk choices. They're usually middle-of-the-road servers in terms of performance specs. A standard DataNode (a.k.a. Worker node) server, for example, is typically a 2U rack server with a two-socket Intel Sandy Bridge or Ivy Bridge CPU with a total of 12 processors. Each CPU is typically fitted with 64 GB to 128 GB of RAM. DataNodes usually have a dozen 2-TB or 3-TB 3.5-inch hard drives in a JBOD (just a bunch of disks) configuration. [Editor's note: The upward price range quoted above was raised to $15,000 (from $5,000) per server to reflect the inclusion of 12 high-capacity drives in addition to (typically) two standard disks per server.]
Companies seeking a bit more performance, for Spark in-memory analysis or Cloudera Impala, for example, might choose slightly higher clock speeds, 256 GB or more RAM per CPU, while those seeking maximum capacity are choosing 4-TB hard drives.
Management nodes running Hadoop's NameNode (which coordinates data storage) and JobTracker (which coordinates data processing) require less storage but benefit from more reliable power supplies, enterprise-grade disks, RAID redundancy, and a bit more RAM. Connecting the nodes together is a job for redundant 10-Gigabit Ethernet or InfiniBand switches.
It's not uncommon for huge Fortune 100 companies to buy so-called whitebox servers from no-name OEMs for less than $2,500 a crack in high volumes, but it's more typical for the average enterprise to work with Tier 1 vendors such as Cisco, Dell, HP, and IBM. All of these manufacturers now offer servers specifically configured for Hadoop reference architectures for Cloudera, Hortonworks, MapR, and other Hadoop distributions.
Hadoop practitioners often build their own clusters, but appliances have emerged over the last two years offering the convenience of buying everything, including preinstalled software, from a single supplier. EMC's Greenplum division, since spun off as part of Pivotal, was the first to offer a Hadoop appliance, but Oracle, Teradata, IBM, and Microsoft have since followed suit. Appliances may require a minimum half-rack commitment, so they may not be ideal for initial experimentation. Several appliances have fringe benefits, including shared management software and analytic-database or NoSQL-database options.
Not all Hadoop deployments run on middle-of-the-road hardware. Cray and SGI have options to deploy Hadoop on high-performance computing clusters. And with big data being, by definition, a power-intensive pursuit, experiments are underway with low-power servers and next-generation ARM chips that may lure at least some Hadoop users away from the hegemony of x86 servers.
Read on for a look at the server and appliance offerings dominating Hadoop deployments, as well a few of the fringe offerings bringing new twists to the world of Hadoop hardware.