Big Data // Software Platforms
News
8/5/2013
04:28 PM
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
100%
0%

Big Data Debate: Will HBase Dominate NoSQL?

HBase offers both scalability and the economy of sharing the same infrastructure as Hadoop, but will its flaws hold it back? NoSQL experts square off.

HBase is modeled after Google BigTable and is part of the world's most popular big data processing platform, Apache Hadoop. But will this pedigree guarantee HBase a dominant role in the competitive and fast-growing NoSQL database market?

Michael Hausenblas of MapR argues that Hadoop's popularity and HBase's scalability and consistency ensure success. The growing HBase community will surpass other open-source movements and will overcome a few technical wrinkles that have yet to be worked out.

Jonathan Ellis of DataStax, the support provider behind open-source Cassandra, argues that HBase flaws are too numerous and intrinsic to Hadoop's HDFS architecture to overcome. These flaws will forever limit HBase's applicability to high-velocity workloads, he says.

Read what our two NoSQL experts have to say, and then weigh in with your opinion in the comments section below.

For The Motion

 Michael Hausenblas
Michael Hausenblas
Chief Data Engineer EMEA, MapR Technologies

Integration With Hadoop Will Drive Adoption

The answer to the question is a crystal-clear "Yes, but…"

In order to appreciate this response, we need to step back a bit and understand the question in context. Both Martin Fowler, in 2011, and Mike Stonebraker, in 2005, took up the polyglot persistence argument that "one size does not fit it all."

Hence, I'm going to interpret the "dominant" in the question not in the sense of the market-share measures applied to relational databases over the past 10 years, but along the line of, "Will Apache HBase be used across a wider range of use cases and have a bigger community behind it than other NoSQL databases?"

This is a bold assertion given that there are more than 100 different NoSQL options to choose from, including MongoDB, Riak, Couchbase, Cassandra and many, many others. But in this big-data era, the trend is away from specialized information silos to large-scale processing of varied data, so even a popular solution such as MongoDB will be surpassed by HBase.

Why? MongoDB has well-documented scalability issues, and with the fast-growing adoption of Hadoop, the NoSQL solution that integrates directly with Hadoop has a marked advantage in scale and popularity. HBase has a huge and diverse community under its belt in all respects: users, developers, multiple commercial vendors and availability in the cloud, the last through Amazon Web Services (AWS), for example.

Historically, both HBase and Cassandra have a lot in common. HBase was created in 2007 at Powerset (later acquired by Microsoft) and was initially part of Hadoop and then became a Top-Level-Project. Cassandra originated at Facebook in 2007, was open sourced and then incubated at Apache, and is nowadays also a Top-Level-Project. Both HBase and Cassandra are wide-column key-value datastores that excel at ingesting and serving huge volumes of data while being horizontally scalable, robust and providing elasticity.

There are philosophical differences in the architectures: Cassandra borrows many design elements from Amazon's DynamoDB system, has an eventual consistency model and is write-optimized while HBase is a Google BigTable clone with read-optimization and strong consistency. An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use.

From an application developer's point of view, HBase is preferable as it offers strong consistency, making life easier. One of the misconceptions about eventual consistency is that it improves write speed: given a sustained write traffic, latency is affected and one ends up paying the "eventual consistency tax" without getting its benefits.

There are some technical limitations with almost all NoSQL solutions, like compactions affecting consistent low latency, inability to shard automatically, reliability issues and long recovery times for node outages. Here at MapR, we've created a "next version" of enterprise HBase that includes instant recovery, seamless sharding and high availability, and that gets rid of compactions. We brought it into GA under the label M7 in May 2013 and it's available in the cloud via AWS Elastic MapReduce.

Last but not least, HBase has -- through its legacy as a Hadoop contribution project -- a strong and solid integration into the entire Hadoop ecosystem, including Apache Hive and Apache Pig.

Summarizing, HBase will be the dominant NoSQL platform for use cases where fast and small-size updates and look-ups at scale are required. Recent innovations have also provided architectural advantages to eliminate compactions and provide truly decentralized co-ordination.

Michael Hausenblas is chief data engineer, EMEA, at MapR Technologies. His background is in large-scale data integration research and development, advocacy and standardization.

Against The Motion

 Jonathan Ellis
Jonathan Ellis
Co-founder & CTO,
DataStax

HBase Is Plagued By Too Many Flaws

NoSQL includes several specialties such as graph databases and document stores where HBase does not compete, but even within its category of partitioned row store, HBase lags behind the leaders. The technical shortcomings driving HBase's lackluster adoption fall into two major categories: engineering problems that can be addressed given enough time and manpower, and architectural flaws that are inherent to the design and cannot be fixed.

Engineering Problems

-- Operations are complex and failure prone. Deploying HBase involves configuring at a minimum a Zookeeper ensemble, primary HMaster, secondary HMaster, RegionServers, active NameNode, standby NameNode, HDFS quorum journal manager and DataNodes. Installation can be automated, but if it's too difficult to install without help, how are you going to troubleshoot it when something goes wrong during, for instance, RegionServer failover or a lower-level NameNode failure? HBase requires substantial expertise to even know what to monitor, and God help you if you need regular backups.

-- RegionServer failover takes 10 to 15 minutes. HBase partitions rows into regions, each managed by a RegionServer. The RegionServer is a single point of failure for its region; when it goes down, a new one must be selected and write-ahead logs must be replayed before writes or reads can be served again.

-- Developing against HBase is painful. HBase's API is clunky and Java centric. Non-Java clients are relegated to the second-class Thrift or REST gateways. Contrast that with the Cassandra Query Language, which offers developers a familiar, productive experience in all languages.

-- The HBase community is fragmented. The Apache mainline is widely understood to be unstable. Cloudera, Hortonworks, and advanced users maintain their own patch trees on top. Leadership is divided and there is no clear roadmap. Conversely, the open-source Cassandra community includes committers from DataStax, Netflix, Spotify, Blue Mountain Capital, and others working together without cliques or forks.

Overall, the engineering gap between HBase and other NoSQL platforms has increased since I've been observing the NoSQL ecosystem. When I first evaluated them, I would have put HBase six months behind Cassandra in engineering progress, but today that lead has widened to about two years.

Architectural Flaws

-- Master-oriented design makes HBase operationally inflexible. Routing all reads and writes through the RegionServer master means that active/active asynchronous replication across multiple datacenters is not possible for HBase, nor can you perform workload separation across different replicas in a cluster. By contrast, Cassandra's peer-to-peer replication allows seamless integration of Hadoop, Solr and Cassandra with no ETL while allowing you to opt in to lightweight transactions in the rare cases when you need linearizability.

-- Failover means downtime. Even one minute of downtime is simply not acceptable in many applications, and this is an intrinsic problem with HBase's design; each RegionServer is a single point of failure. A fully distributed design instead means that when one replica goes down, there is no need for special-case histrionics to recover; the system keeps functioning normally with the other replicas and can catch up the failed one later.

-- HDFS is primarily designed for streaming access to large files. HBase is built on a distributed file system optimized for batch analytics. This is directly responsible for HBase's poor performance, particularly for reads, and particularly on solid-state disks. Just as relational databases haven't been able to optimize btree engines designed 30 years ago for pre-big-data workloads, HDFS won't be able to undo the tradeoffs it made for what is still its primary purpose and close the gap on critical functionality:

-- Mixing solid state and hard disks in a single cluster and pinning tables to workload-appropriate media.

-- Snapshots, incremental backups, and point-in-time recovery.

-- Compaction throttling to avoid spikes in application response time.

-- Dynamically routing requests to the best-performing replicas.

The same design that makes HBase's foundation, HDFS, a good fit for batch analytics will ensure that it remains inherently unsuited for the high velocity, random access workloads that characterize the NoSQL market.

Jonathan Ellis is chief technology officer and co-founder at DataStax, where he sets the technical direction and leads Apache Cassandra as project chair.

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
vrodionov
50%
50%
vrodionov,
User Rank: Apprentice
8/21/2013 | 6:48:49 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
Cassandra "flexible data placement (a.k.a SSD support)" is not that good. You put the whole Column Family into SSD , eventually CF will exceed the SSD size and than what? It is not the hot data set caching per se.
mhausenblas
50%
50%
mhausenblas,
User Rank: Apprentice
8/13/2013 | 6:03:47 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
Valid point, yes. The argument was along the line: FB created Cassandra in the first place, then replaced it with something else (which happened to be HBase). Not the strongest argument, I admit, more an indicator.

However, as I said in the first paragraph: it's all relative, really. One size doesn't fit it all in the data storage and processing world (aka polyglot persistence). In this context I like to encourage everyone who hasn't done already to read Stonebraker's excellent piece (from 2005!): http://citeseerx.ist.psu.edu/v...
EricL755
50%
50%
EricL755,
User Rank: Apprentice
8/12/2013 | 10:48:03 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
I am not sure how this holds to a proof point, "An interesting proof point for the superiority of HBase is the fact that Facebook, the creator of Cassandra, replaced Cassandra with HBase for their internal use." Why does Facebook choosing it mean that it's superior?

This is using the argument from authority logic. In other words, if most of what Facebook engineering does is right and they choose HBase, then it must be right. There is certainly no question as to whether or not Facebook is full of brilliant engineers. But there are plenty of other companies that do amazing things with technology who have made the decision to go with Cassandra. You can't say that HBase is a good choice simply because Facebook uses it.
vrodionov
50%
50%
vrodionov,
User Rank: Apprentice
8/12/2013 | 6:39:06 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
"Compaction throttling to avoid spikes in application response time.- M7 does not have any Compactions - Done"

No compactions? Does M7 overwrite data in place?

The major issue with M3/5/7 is that it does not provide easy migration/upgrade from existing Hadoop/HBase to MapR's distribution. At least, this was the case in the late 2011. Besides this, its proprietary technology.
nbandugula
50%
50%
nbandugula,
User Rank: Apprentice
8/9/2013 | 9:04:54 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
To complete the story on MapR's innovation that Michael referred to, here are some things we have done with MapR M7 to make HBase applications enterprise-grade:

Following Ellis' lead:

a. Master-oriented design makes HBase operationally inflexible. - M7 does not have a region server architecture - Done

b. Failure means Downtime - M7 does not have single points of failure and recovers in less than a minute - Done

c. HDFS is designed for streaming access to large files - M7 does not rely on HDFS - Done

- Mixing solid state and hard disks in a single cluster and pinning tables to workload-appropriate media. - M7 works with disparate hardware including SSDs - Done

-- Snapshots, incremental backups, and point-in-time recovery.- M7 provides all of these features - Done

-- Compaction throttling to avoid spikes in application response time.- M7 does not have any Compactions - Done

-- Dynamically routing requests to the best-performing replicas.- M7 delivers this functionality as well - Done

Plus M7 is a complete distribution for Apache Hadoop that supports more than a dozen Apache projects and a wide variety of 3rd party tools including for SQL query.
RSCHUMACHER400
50%
50%
RSCHUMACHER400,
User Rank: Apprentice
8/7/2013 | 7:41:55 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
Hi Doug - please see our customers page for details (http://www.datastax.com/custom..., but in brief, we do have customers that use more than just Cassandra (C*). On our customers page you'll find examples like MarkedUp (all 3), eBay (C* and Hadoop), Datafiniti (C* and Solr), HealthCare Anytime (all 3), Constant Contact (C* and Hadoop), SimpleReach (C* and Hadoop), Boxever (C* and Hadoop), and Skillpages (all 3).
vrodionov
50%
50%
vrodionov,
User Rank: Apprentice
8/7/2013 | 12:11:13 AM
re: Big Data Debate: Will HBase Dominate NoSQL?
Mr. Ellis, everyone here understands that your analyses and opinion as well as all tests results you are referring to are highly biased in favor of Cassandra. I lmao (ye-h, I know some basic slang) when I read PDF you have posted link here to. 90msec read latency? Have the authors read data from other data center? In case of HBase? When all data fits block cache or OS page cache - the read latency is less than 1ms (actually - its 0.4-0.5ms in average). We (the company I am working on) have being routinely running different workloads on HBase in dev, staging and production for more than tree years already and stability, performance and feature set of HBase are getting better with every new version. For me (and for many others) , major advantages of HBase are:

1. Tight integration into Hadoop/HDFS stack. I think its the major one and this eventually will bring HBase on top of NoSQl crowd.

2. Extensibility. Coprocessors are very good feature for any one trying to implement something more complex than simple K-V look up.

3. Can I say that HBase is more SQL - friendly? Phoenix, Hive?

HBase (properly tuned and configured) is not beatable in write heavy workloads. We can get far more than 1M writes per sec from 20 node cluster (not from 200 as Mr. Netflix guy). Yes , the cluster and clients are tuned and use all recommended performance tips. Complex? May be. but eventually, everything will become available from out of box, w/o any additional tuning.

You are so proud of Cassandra random reads "domination" (due to row cache mostly in Cassandra and the lack of thereof in HBase ), but I would like to point out that Cassandra cache (both key and row) are half-baked and the implementation is far from optimal (you still keep keys in Java heap?). Sorry, I am not following the latest advancements in Cassandra development now. Moreover, the lack of good block cache in Cassandra makes Cassandra less suitable for short scan operations (one of the reasons, Facebook has decided in favor of HBase). For me, personally, its a deal breaker, because so many real customer workloads fall into "short scan operation" category. Another deal breaker is the lack of real Hadoop integration.

Random read performance in HBase (I do not think its really worse than Cassandra's) can be increased by introducing RowCache into HBase and when it will happen, I think, we will get indisputable winner, Mr. Ellis. Its doable and it is going to happen pretty soon.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
8/6/2013 | 8:22:42 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
"Dominant" doesn't discount the opportunity for diversity, though, I'll admit, it's a somewhat simplistic construction meant to spark debate. The question was NOT posed as an either/or. DataStax chose (for obvious reasons) to focus on HBase vs. Cassandra. I do think many people do have big expectations for HBase because of its tie with Hadoop. Perhaps a bigger role will emerge if some of the flaws DataStax points to can be addressed.
EB Quinn
100%
0%
EB Quinn,
User Rank: Apprentice
8/6/2013 | 6:28:01 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
A bit of a silly premise, and definitely not an either/or scenario: HBase will clearly be used when Hadoop is used - end of story. Cassandra isn't going to displace HBase, but will co-exist to handle other, related use cases, more elegantly. Plus, MongoDB will be used as a more modern era alternative to mySQL, HANA will be used to fly through SAP analytics, MarkLogic excels at content-oriented apps. And there are several dedicated cloud databases too. The NoSQL (Not Only SQL) movement gains strength from diversity, and has pushed Oracle, IBM and Microsoft to offer up columnar, for example, options. But at this point NONE of the NoSQL databases could be considered dominant, and despite the growing popularity of Hadoop, no way HBase is going to extend into a more general purpose DB, it lacks the architectural chops (pointed out nicely by Mr. Ellis), and it lacks the expertise base with the chops.

When the day comes where there are more production Hadoop implementations than the combination of SAS, ODW, Teradata, IBM's many options, SAP BW and HANA, Microstrategy, Tableau, etc., etc., etc., well, maybe we can talk dominant down one DNA strain of the industry. That will take quite awhile.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
8/6/2013 | 6:22:59 PM
re: Big Data Debate: Will HBase Dominate NoSQL?
This debate sparks many questions for me. Can open source purists, for example, detail how HBase will overcome the flaws that DataStax cites -- some of which MapR has addressed, albeit with a proprietary, commercial approach? DataStax can certainly say it integrates Cassandra with Hadoop (providing the same shared infrastructure advantages of the combo of HBase and Hadoop), but why do I hear little to nothing about customers relying on DataStax for their Hadoop deployments? Can you name names of customers that actually do it all (Cassandra, Hadoop and Solr) with DataStax' software? The focus is clearly on Cassandra.

Hortonworks and Cloudera, what's your take, as you clearly have a big stake in HBase success?
In A Fever For Big Data
In A Fever For Big Data
Healthcare orgs are relentlessly accumulating data, and a growing array of tools are becoming available to manage it.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - August 20, 2014
CIOs need people who know the ins and outs of cloud software stacks and security, and, most of all, can break through cultural resistance.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.