Commentary
10/23/2014
11:06 AM
Ashish Thusoo
Ashish Thusoo
Commentary

Hadoop: 5 Undeniable Truths

Yes, you still need a traditional data warehouse after beginning work with Hadoop. Here are four more fundamental points to know about the big data platform.



"Is the Data Warehouse Dead?" screamed IBM Data Magazine in 2013, warning of the doom awaiting the relational database.

"The Death of MapReduce," proclaimed a blog post on the Paper Trail earlier this year, when Google said it had adopted DataFlow and stopped using MapReduce.

Everyone knows sensationalist headlines can be distracting or inaccurate. The real problem with such overblown headlines is this: Superficial debates are slowing down the true potential of Hadoop, big data, and the evolution of traditional databases.

[Real-time data analysis for Azure customers? Read Microsoft Brings Storm Stream Analysis To Hadoop.]

At Qubole we often receive calls from potential customers who are confused about Hadoop's capabilities. They believe that Hadoop is the savior for their newest analytics project and that it can replace all the functions of their existing data warehouses. Confusion at such a basic level leads to a poor customer experience, wasted dollars, and headaches -- all which could be avoided with better education.

So let's set the record straight. Here are 5 simple truths about Hadoop:

1. You still need a traditional data warehouse. Traditional data warehouses allow for high-fidelity data and subsequent analysis, which are ultimately fundamental to businesses. Data warehouses make powerful use of structured, relational data, whereas Hadoop excels at managing unstructured, semi-structured or log data that classic data warehouses can't handle well. The two make an attractive odd couple.

2. Hadoop isn't great at real-time analytics. Hadoop is a great fit for staging vast amounts of raw data in order to extract summaries that can then be loaded into traditional enterprise data warehouses to conduct low-latency analytics. Real-time analytics, while making great advances on Hadoop with tools such as Presto and Apache Spark, are still best served by the traditional databases.

3. A Hadoop-only strategy is dangerous. Why would you need a Hadoop solution to process 10 GB of highly structured data? Yet we run into customers wanting to use Hadoop for exactly those small-scale needs. Sacrificing traditional data warehouses and relying solely on Hadoop is a dangerous move. A traditional database is still a necessity for managing day-to-day business operations, and the majority of businesses simply don't have the resources or expertise needed to run a Hadoop cluster for every data query. Given how critical it is for a Hadoop initiative to prove initial return quickly, attempting to use the platform in ways it is not intended to be used will create disillusionment toward Hadoop and its true capabilities.

4. Hadoop is difficult to use. Praise for Hadoop and promise of big data has created a magical haze around the technology that can mask its complexity. An investment in Hadoop requires an investment in a cluster management team in addition to infrastructure. Many Hadoop users come to us or another managed service provider because they didn't have the capacity to manage their Hadoop clusters or scale up to meet their customers' demands. With a limited budget, the question came down to hiring new talent and investing in additional clusters or denying requests.

5. You don't need the whole ecosystem. If your organization is actively involved in the open source community, you don't need to use the entire zoo of Hadoop tools. However, those on the business end frequently misunderstand the purpose of the tools and don't know that their business probably requires the use of only one or two engines. For example, we would advise those looking for a SQL on Hadoop tool for data exploration to turn to Presto or another SQL-on-Hadoop option rather than Hive, since Hive doesn't offer interactive speeds.

The Hadoop community must make greater efforts to educate users and correct misinformation about Hadoop's capabilities. It's ironic that in the information age, finding accurate and comprehensive information is so difficult and that some of the most helpful conversations continue to take place offline or in forums like Quora. With such a large informational hole to fill, why shouldn't the industry that's rapidly changing the way businesses think about data also change the way we market that technology? To start, let's stop marginalizing other technologies and start playing nicely with others in the field.

What will you use for your big data platform? A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? One size doesn't fit all. Here's how to decide. Get the new Pick Your Platform For Big Data issue of InformationWeek Tech Digest today. (Free registration required.)

Ashish Thusoo is CEO and co-founder of the big-data startup Qubole. Before Qubole, he ran Facebook's data infrastructure team. He is also the co-creator of Apache Hive and served as the project's founding vice president at the Apache Software Foundation. View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Email This  | 
Print  | 
RSS
More Insights
Copyright © 2021 UBM Electronics, A UBM company, All rights reserved. Privacy Policy | Terms of Service