Big Data Debate: Will HBase Dominate NoSQL?

HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.

Doug Henschen, Executive Editor, Enterprise Apps

August 5, 2013

8 Min Read

HBase is modeled after Google BigTable and is part of the world's most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop's popularity and HBase's scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop's HDFS architecture to overcome. These flaws will forever limit HBase's applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

For The Motion

Michael Hausenblas

Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption

The answer to the question is a crystal-clear "Yes, but…"

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that "one size does not fit it all."

Hence, I'm going to interpret the "dominant" in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, "Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?"

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last through Amazon Web Services (AWS), for example.

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

From an application developer's point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the "eventual consistency tax" without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we've created a "next version" of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it's available in the cloud via AWS Elastic MapReduce.

Last but not least, HBase has -- through its legacy as a Hadoop contribution project -- a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

Against The Motion

Jonathan Ellis

Jonathan Ellis
Co-founder & CTO,
DataStax

HBase Is Plagued By Too Many Flaws

NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase's lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

Engineering Problems

-- Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it's too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

-- RegionServer failover takes 10 to 15 minutes. HBase partitions rows into regions, each managed by a RegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.

-- Developing against HBase is painful. HBase's API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

-- The HBase community is fragmented. The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I've been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

Architectural Flaws

-- Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra's peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

-- Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase's design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.

-- HDFS is primarily designed for streaming access to large files. HBase is built on a distributed file system optimized for batch analytics. This is directly responsible for HBase's poor performance, particularly for reads, and particularly on solid-state disks. Just as relational databases haven't been able to optimize btree engines designed 30 years ago for pre-big-data workloads, HDFS won't be able to undo the tradeoffs it made for what is still its primary purpose and close the gap on critical functionality:

-- Mixing solid state and hard disks in a single cluster and pinning tables to workload-appropriate media.

-- Snapshots, incremental backups, and point-in-time recovery.

-- Compaction throttling to avoid spikes in application response time.

-- Dynamically routing requests to the best-performing replicas.

The same design that makes HBase's foundation, HDFS, a good fit for batch analytics will ensure that it remains inherently unsuited for the high velocity, random access workloads that characterize the NoSQL market.

Jonathan Ellis is chief technology officer and co-founder at DataStax, where he sets the technical direction and leads Apache Cassandra as project chair.

About the Author

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

See more from Doug Henschen

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics