Software // Information Management
Commentary
10/30/2013
11:37 AM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
Google+
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Facebook Exec: Databases, Hadoop Belong Together

Facebook's Ken Rudin says relational databases and Hadoop are complementary. Hortonworks and Teradata execs add guidelines on when to choose which.

It's not either-or, it's both. This was a big theme at this week's Strata/Hadoop World conference in New York as executives from Facebook, Hortonworks and Teradata all discussed how relational database management systems (RDBMS) and Hadoop fit together.

Facebook's Ken Rudin, head of analytics, knows the needs of ordinary enterprises all too well, having worked at both Salesforce.com and Siebel before founding SaaS-based BI vendor LucidEra. (After LucidEra was sold, Rudin ventured into big data analytics, first at Zynga and now at Facebook.) So it wasn't shocking to hear Rudin extol the virtues of relational databases, even though Facebook is a big Hadoop user.

"We keep the low-level, most granular, detailed data in Hadoop at Facebook, but move the transformed and aggregated data into relational databases because slicing and dicing is faster and easier," said Rudin said during a keynote presentation.

In a follow-up interview, Rudin told me that Hadoop is essentially the discovery platform where data scientist types have the flexibility to explore data without the constraints of a predefined data model. There they can make novel discoveries about trends, behavior patterns and relationships.

[ Want more on Strata announcements? Read Cloudera Plans Data Hub Role For Hadoop. ]

"We'll find out what's important through ad hoc analysis on Hadoop, and once that's known, we'll aggregate the data across the dimensions that we need and put that into a relational environment," Rudin explained. "With a relational system I can get answers in seconds instead of tens of minutes."

There are some analyses that have to stay in the big data realm of Hadoop, including graph analyses and optimizations involving complicated computations that just don't take to SQL. Facebook has its own graph analysis technology for finding relationships among people at scale, and this is the stuff that really made Facebook tick as a social network.

In an example of optimization, Facebook has to figure out which items, out of thousands of possibilities, to put in user news feeds. It's not unlike the advertising optimization work that Internet giants routinely handled in Hadoop.

"It's not a metrics and dimensions problem, it's a long, linear equation, and we need to process it for one million people at a time," Rudin explained.

So when do you choose Hadoop and when do you choose relational? That's a topic Stephen Brobst, CTO at Teradata, and Ari Zilka, CTO at Hortonworks, took up in a discussion of the best uses of relational databases and Hadoop. To my great surprise, Hortonworks exec Zilka made the case for relational databases while Brobst made the case for Hadoop.

Zilka was a cofounder of the Terracotta in-memory database, so he's no stranger to the relational world, but hearing a relational database vendor exec like Brobst make a strong case for Hadoop was refreshing.

"The kinds of problems you're trying to solve [with Hadoop] are not about generating a report," Brobst explained. "Hadoop is for much more sophisticated uses like analyzing text on Web pages or analyzing relationships, and you use techniques like machine learning, scoring and building search indexes that solve very different problems."

Relational databases have effectively joined the big data world, Zilka argued, by way of massively parallel processing. MPP is the architecture behind relational products including Teradata, Pivotal Greenplum, IBM PureEdge for Analytics (formerly Netezza), Actian's ParAccel, HP Vertica, Microsoft SQL Server PDW and others.

"There's nothing about relational that is too old or too stodgy or too small to handle the data volume of even the largest transactional data sets," Zilka argued.

Still need help understanding which platform to choose? Zilka and Brobst ended with a nice list of attributes to consider:

-- Stable schema = RDBMS; evolving schema = Hadoop
-- Structured data = RDBMS; variably structured data = Hadoop
-- ANSI SQL = RDBMS ; flexible programming = Hadoop
-- Cleaned data = RDBMS; raw data = Hadoop
-- Updates/deletes = RDBMS; ingest = Hadoop
-- Core data = RDBMS; all data = Hadoop
-- Complex joins = RDBMS; complex processing = Hadoop
-- Efficient use of CPU/IO = RDBMS; low-cost storage = Hadoop.

Rudin, Zilka and Brobst all supported the notion that Hadoop is more closely aligned with what you might call bigger, more exploratory questions. You don't even attempt to bring uniform structured to the data, as you would with a relational database, because you don't even know what questions you want to ask yet.

"In the big data world, all data has value, you just haven't found it yet," Brobst explained. "If you use a different economic model, leverage the open source characteristics of Hadoop, and leverage commodity storage and servers, you can store multiple orders of magnitude more data. That data lake allows us to explore the data and discover where that value is."

So there you have it: Hadoop and RDBMS are destined to live together. That may not be peace and harmony in the early stages, as teams compete for budgets and workloads. But if the interests of the business are to be best served, live together they will.

Comment  | 
Print  | 
More Insights
Comments
Oldest First  |  Newest First  |  Threaded View
David F. Carr
50%
50%
David F. Carr,
User Rank: Author
10/31/2013 | 4:14:45 PM
re: Facebook Exec: Databases, Hadoop Belong Together
Love the summary of the trade offs. I don't know that I've seen that expressed this clearly anywhere else.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
11/1/2013 | 2:11:04 PM
re: Facebook Exec: Databases, Hadoop Belong Together
A lot of relational database fans are Tweeting with satisfaction that this article offers evidence that Hadoop will not replace RDBMS. I don't think knowledgeable Hadoop fans ever made that claim. Cloudera's "center is shifting" argument, for example, never asserted that RDBMSs would go away. That company's latest "Enterprise Data Hub" spin (which echoes the Sears/Metascale vision) sees Hadoop handling all the raw data at high scale. RDBMS becomes a "specialized" warehouse/mart environment for fast analysis of refined, structured data. In other words, think marts and focused operational data warehouses.

The one RDBMS concept that goes away if the Enterprise Data Hub vision takes hold is the all-encompassing enterprise data warehouse (EDW). EDWs mostly fell short of that "enterprisewide" vision, despite costly and time-consuming effort. Keeping up with variable, ever-changing data is something that the RDBMS just doesn't do well. And trying to do it at high scale with an RDBMS is an expensive proposition.
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 16, 2014.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.