Hadoop: 5 Undeniable Truths - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
Commentary
10/23/2014
11:06 AM
Ashish Thusoo
Ashish Thusoo
Commentary
50%
50%

Hadoop: 5 Undeniable Truths

Yes, you still need a traditional data warehouse after beginning work with Hadoop. Here are four more fundamental points to know about the big data platform.

"Is the Data Warehouse Dead?" screamed IBM Data Magazine in 2013, warning of the doom awaiting the relational database.

"The Death of MapReduce," proclaimed a blog post on the Paper Trail earlier this year, when Google said it had adopted DataFlow and stopped using MapReduce.

Everyone knows sensationalist headlines can be distracting or inaccurate. The real problem with such overblown headlines is this: Superficial debates are slowing down the true potential of Hadoop, big data, and the evolution of traditional databases.

[Real-time data analysis for Azure customers? Read Microsoft Brings Storm Stream Analysis To Hadoop.]

At Qubole we often receive calls from potential customers who are confused about Hadoop's capabilities. They believe that Hadoop is the savior for their newest analytics project and that it can replace all the functions of their existing data warehouses. Confusion at such a basic level leads to a poor customer experience, wasted dollars, and headaches -- all which could be avoided with better education.

So let's set the record straight. Here are 5 simple truths about Hadoop:

1. You still need a traditional data warehouse. Traditional data warehouses allow for high-fidelity data and subsequent analysis, which are ultimately fundamental to businesses. Data warehouses make powerful use of structured, relational data, whereas Hadoop excels at managing unstructured, semi-structured or log data that classic data warehouses can't handle well. The two make an attractive odd couple.

2. Hadoop isn't great at real-time analytics. Hadoop is a great fit for staging vast amounts of raw data in order to extract summaries that can then be loaded into traditional enterprise data warehouses to conduct low-latency analytics. Real-time analytics, while making great advances on Hadoop with tools such as Presto and Apache Spark, are still best served by the traditional databases.

3. A Hadoop-only strategy is dangerous. Why would you need a Hadoop solution to process 10 GB of highly structured data? Yet we run into customers wanting to use Hadoop for exactly those small-scale needs. Sacrificing traditional data warehouses and relying solely on Hadoop is a dangerous move. A traditional database is still a necessity for managing day-to-day business operations, and the majority of businesses simply don't have the resources or expertise needed to run a Hadoop cluster for every data query. Given how critical it is for a Hadoop initiative to prove initial return quickly, attempting to use the platform in ways it is not intended to be used will create disillusionment toward Hadoop and its true capabilities.

4. Hadoop is difficult to use. Praise for Hadoop and promise of big data has created a magical haze around the technology that can mask its complexity. An investment in Hadoop requires an investment in a cluster management team in addition to infrastructure. Many Hadoop users come to us or another managed service provider because they didn't have the capacity to manage their Hadoop clusters or scale up to meet their customers' demands. With a limited budget, the question came down to hiring new talent and investing in additional clusters or denying requests.

5. You don't need the whole ecosystem. If your organization is actively involved in the open source community, you don't need to use the entire zoo of Hadoop tools. However, those on the business end frequently misunderstand the purpose of the tools and don't know that their business probably requires the use of only one or two engines. For example, we would advise those looking for a SQL on Hadoop tool for data exploration to turn to Presto or another SQL-on-Hadoop option rather than Hive, since Hive doesn't offer interactive speeds.

The Hadoop community must make greater efforts to educate users and correct misinformation about Hadoop's capabilities. It's ironic that in the information age, finding accurate and comprehensive information is so difficult and that some of the most helpful conversations continue to take place offline or in forums like Quora. With such a large informational hole to fill, why shouldn't the industry that's rapidly changing the way businesses think about data also change the way we market that technology? To start, let's stop marginalizing other technologies and start playing nicely with others in the field.

What will you use for your big data platform? A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? One size doesn't fit all. Here's how to decide. Get the new Pick Your Platform For Big Data issue of InformationWeek Tech Digest today. (Free registration required.)

Ashish Thusoo is CEO and co-founder of the big-data startup Qubole. Before Qubole, he ran Facebook's data infrastructure team. He is also the co-creator of Apache Hive and served as the project's founding vice president at the Apache Software Foundation. View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Comments
Threaded  |  Newest First  |  Oldest First
D. Henschen
IW Pick
100%
0%
D. Henschen,
User Rank: Author
10/23/2014 | 1:17:21 PM
Why Presto?
In the pantheon of SQL-on-Hadoop offerings, I don't hear about Presto too much. Why not recommend Spark SQL, Impala, Drill, a true relational database (HAWQ/Greenplum or Vertica) on Hadoop, or, for that matter, the greatly improved Hive in Hadoop 2.0 that more people are using than any other SQL-on-Hadoop option out there? What does Presto have going for it other than the fact that it's offered by Qubole?
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
10/23/2014 | 4:36:50 PM
Resolved: sensational headlines should be ended
Judging by his intro, Ashish wishes to elminate confusion, wasted dollars and sensational headlines. Agreed. And good luck on the latter.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
10/24/2014 | 5:00:25 PM
Re: Why Presto?
"Facebook uses it" is not much of an answer as to why it has technical superiority over other SQL-on-Hadoop offerings. What's the query latency vs. Hive and what's the breadth and depth of SQL or SQL-like support? Readers want to know about performance, versatility and familiarity to SQL developers.
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

News
Pandemic Responses Make Room for More Data Opportunities
Jessica Davis, Senior Editor, Enterprise Apps,  5/4/2021
Slideshows
10 Things Your Artificial Intelligence Initiative Needs to Succeed
Lisa Morgan, Freelance Writer,  4/20/2021
News
Transformation, Disruption, and Gender Diversity in Tech
Joao-Pierre S. Ruth, Senior Writer,  5/6/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll