Facebook's Ken Rudin says relational databases and Hadoop are complementary. Hortonworks and Teradata execs add guidelines on when to choose which.
It's not either-or, it's both. This was a big theme at this week's Strata/Hadoop World conference in New York as executives from Facebook, Hortonworks and Teradata all discussed how relational database management systems (RDBMS) and Hadoop fit together.
Facebook's Ken Rudin, head of analytics, knows the needs of ordinary enterprises all too well, having worked at both Salesforce.com and Siebel before founding SaaS-based BI vendor LucidEra. (After LucidEra was sold, Rudin ventured into big data analytics, first at Zynga and now at Facebook.) So it wasn't shocking to hear Rudin extol the virtues of relational databases, even though Facebook is a big Hadoop user.
"We keep the low-level, most granular, detailed data in Hadoop at Facebook, but move the transformed and aggregated data into relational databases because slicing and dicing is faster and easier," said Rudin said during a keynote presentation.
In a follow-up interview, Rudin told me that Hadoop is essentially the discovery platform where data scientist types have the flexibility to explore data without the constraints of a predefined data model. There they can make novel discoveries about trends, behavior patterns and relationships.
"We'll find out what's important through ad hoc analysis on Hadoop, and once that's known, we'll aggregate the data across the dimensions that we need and put that into a relational environment," Rudin explained. "With a relational system I can get answers in seconds instead of tens of minutes."
There are some analyses that have to stay in the big data realm of Hadoop, including graph analyses and optimizations involving complicated computations that just don't take to SQL. Facebook has its own graph analysis technology for finding relationships among people at scale, and this is the stuff that really made Facebook tick as a social network.
In an example of optimization, Facebook has to figure out which items, out of thousands of possibilities, to put in user news feeds. It's not unlike the advertising optimization work that Internet giants routinely handled in Hadoop.
"It's not a metrics and dimensions problem, it's a long, linear equation, and we need to process it for one million people at a time," Rudin explained.
So when do you choose Hadoop and when do you choose relational? That's a topic Stephen Brobst, CTO at Teradata, and Ari Zilka, CTO at Hortonworks, took up in a discussion of the best uses of relational databases and Hadoop. To my great surprise, Hortonworks exec Zilka made the case for relational databases while Brobst made the case for Hadoop.
Zilka was a cofounder of the Terracotta in-memory database, so he's no stranger to the relational world, but hearing a relational database vendor exec like Brobst make a strong case for Hadoop was refreshing.
"The kinds of problems you're trying to solve [with Hadoop] are not about generating a report," Brobst explained. "Hadoop is for much more sophisticated uses like analyzing text on Web pages or analyzing relationships, and you use techniques like machine learning, scoring and building search indexes that solve very different problems."
Relational databases have effectively joined the big data world, Zilka argued, by way of massively parallel processing. MPP is the architecture behind relational products including Teradata, Pivotal Greenplum, IBM PureEdge for Analytics (formerly Netezza), Actian's ParAccel, HP Vertica, Microsoft SQL Server PDW and others.
"There's nothing about relational that is too old or too stodgy or too small to handle the data volume of even the largest transactional data sets," Zilka argued.
Still need help understanding which platform to choose? Zilka and Brobst ended with a nice list of attributes to consider:
-- Stable schema = RDBMS; evolving schema = Hadoop
-- Structured data = RDBMS; variably structured data = Hadoop
-- ANSI SQL = RDBMS ; flexible programming = Hadoop
-- Cleaned data = RDBMS; raw data = Hadoop
-- Updates/deletes = RDBMS; ingest = Hadoop
-- Core data = RDBMS; all data = Hadoop
-- Complex joins = RDBMS; complex processing = Hadoop
-- Efficient use of CPU/IO = RDBMS; low-cost storage = Hadoop.
Rudin, Zilka and Brobst all supported the notion that Hadoop is more closely aligned with what you might call bigger, more exploratory questions. You don't even attempt to bring uniform structured to the data, as you would with a relational database, because you don't even know what questions you want to ask yet.
"In the big data world, all data has value, you just haven't found it yet," Brobst explained. "If you use a different economic model, leverage the open source characteristics of Hadoop, and leverage commodity storage and servers, you can store multiple orders of magnitude more data. That data lake allows us to explore the data and discover where that value is."
So there you have it: Hadoop and RDBMS are destined to live together. That may not be peace and harmony in the early stages, as teams compete for budgets and workloads. But if the interests of the business are to be best served, live together they will.
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.