Big Data Debate: Will HBase Dominate NoSQL? - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Software Platforms
News
8/5/2013
04:28 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
100%
0%

Big Data Debate: Will HBase Dominate NoSQL?

HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.

HBase is modeled after Google BigTable and is part of the world's most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop's popularity and HBase's scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop's HDFS architecture to overcome. These flaws will forever limit HBase's applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

For The Motion

 Michael Hausenblas
Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption

The answer to the question is a crystal-clear "Yes, but…"

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that "one size does not fit it all."

Hence, I'm going to interpret the "dominant" in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, "Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?"

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last through Amazon Web Services (AWS), for example.

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

From an application developer's point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the "eventual consistency tax" without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we've created a "next version" of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it's available in the cloud via AWS Elastic MapReduce.

Last but not least, HBase has -- through its legacy as a Hadoop contribution project -- a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

Against The Motion

 Jonathan Ellis
Jonathan Ellis
Co-founder & CTO,
DataStax

HBase Is Plagued By Too Many Flaws

NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase's lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

Engineering Problems

-- Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it's too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

-- RegionServer failover takes 10 to 15 minutes. HBase partitions rows into regions, each managed by a RegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.

-- Developing against HBase is painful. HBase's API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

-- The HBase community is fragmented. The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I've been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

Architectural Flaws

-- Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra's peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

-- Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase's design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.

-- HDFS is primarily designed for streaming access to large files. HBase is built on a distributed file system optimized for batch analytics. This is directly responsible for HBase's poor performance, particularly for reads, and particularly on solid-state disks. Just as relational databases haven't been able to optimize btree engines designed 30 years ago for pre-big-data workloads, HDFS won't be able to undo the tradeoffs it made for what is still its primary purpose and close the gap on critical functionality:

-- Mixing solid state and hard disks in a single cluster and pinning tables to workload-appropriate media.

-- Snapshots, incremental backups, and point-in-time recovery.

-- Compaction throttling to avoid spikes in application response time.

-- Dynamically routing requests to the best-performing replicas.

The same design that makes HBase's foundation, HDFS, a good fit for batch analytics will ensure that it remains inherently unsuited for the high velocity, random access workloads that characterize the NoSQL market.

Jonathan Ellis is chief technology officer and co-founder at DataStax, where he sets the technical direction and leads Apache Cassandra as project chair.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
8/6/2013 | 8:22:42 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
"Dominant" doesn't discount the opportunity for diversity, though, I'll admit, it's a somewhat simplistic construction meant to spark debate. The question was NOT posed as an either/or. DataStax chose (for obvious reasons) to focus on HBase vs. Cassandra. I do think many people do have big expectations for HBase because of its tie with Hadoop. Perhaps a bigger role will emerge if some of the flaws DataStax points to can be addressed.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
8/6/2013 | 6:22:59 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
This debate sparks many questions for me. Can open source purists, for example, detail how HBase will overcome the flaws that DataStax cites -- some of which MapR has addressed, albeit with a proprietary, commercial approach? DataStax can certainly say it integrates Cassandra with Hadoop (providing the same shared infrastructure advantages of the combo of HBase and Hadoop), but why do I hear little to nothing about customers relying on DataStax for their Hadoop deployments? Can you name names of customers that actually do it all (Cassandra, Hadoop and Solr) with DataStax' software? The focus is clearly on Cassandra.

Hortonworks and Cloudera, what's your take, as you clearly have a big stake in HBase success?
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

News
Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
News
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
Slideshows
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll