The enterprise data warehouse (EDW) is the backbone of analytics and business intelligence for most large organizations and many midsize firms. The tools and techniques are proven, the SQL query language is well known, and there's plenty of expertise available to keep EDWs humming.
The downside of many relational data warehousing approaches is that they're rigid and hard to change. You start by modeling the data and creating a schema, but this assumes you know all the questions you'll need to answer. When new data sources and new questions arise, the schema and related ETL and BI applications have to be updated, which usually requires an expensive, time-consuming effort.
Enter Hadoop, which lets you store data on a massive scale at low cost (compared with similarly scaled commercial databases). What's more it easily handles variety, complexity and change because you don't have to conform all the data to a predefined schema.
That sounds great, but where do you find qualified people who know how to use Pig, Hive, Scoop and other tools needed to run Hadoop? More importantly, how do you get fast answers out of a batch-oriented platform that depends on slow and iterative MapReduce data processing?
Will Hadoop supplant the enterprise data warehouse and relegate relational databases to data mart roles? Or is Hadoop far too green and too slow to change the way most people work? In our debate, Scott Gnau of Teradata and Ben Werther of Platfora square off. Share your opinion using the comment tool at the end of the article.
The proposition of the enterprise data warehouse seems tantalizing -- unifying all the data in your enterprise into one perfect database.
So you start an 18-month journey to find important data sources, agree on the important business questions, map the business processes, and architect and implement it into the one database to rule them all.
And when you are done, if you ever finish, you have a calcified relic of the world 18 months prior. If your world hasn't changed much in 18 months, then that might be ok. But that isn't the reality in any large business I've encountered.
Why is Hadoop was gaining so much momentum? Clearly it's cost-effective and scalable, and it's intimately linked in people's minds to companies like Google, Yahoo and Facebook. But there's more to it. Everywhere I looked, companies are generating more and more data -- interactions, logs, views, purchases, clicks, etc. These were being linked with increasing numbers of new and interesting datasets -- location data, purchased user demographics, Twitter sentiment, etc. The questions that these swirling data sets could one day support can't be known. And yet to build a data warehouse, I'd be expected to perfectly predict what data would be important and how I'd want to question it, years in advance, or spend months rearchitecting every time I was wrong. This is actually considered "best practice."
The brilliance of what Hadoop does differently is that it doesn't ask for any of these decisions up front. You can land raw data, in any format and at any size, in Hadoop with virtually no friction. You don't have to think twice about how you are going to use the data when you write it. No more throwing away data because of cost, friction or politics.
And yet, in the view of the status-quo players, Hadoop is just another data source. It is a dumping ground, and from there you can pull chunks into their carefully architected data warehouses -- their system of record." They'll even provide you a ‘connector' to make the medicine go down sweet. Sure, you are back in the land of consultants and 12-18 month IT projects.
But let's go through the looking glass. The database isn't the "system of record" -- it is just a shadow of the data in Hadoop. In fact there is nothing more authentic than all of that raw data sitting in Hadoop. But machinery has been missing to complete the story, namely a way to do interactive business intelligence, exploration and analysis against the data in Hadoop. Platfora is among the vendors working on this need.
Imagine what this means. Raw data of any kind or type lands in Hadoop with no friction. And without building a data warehouse, without the pain of ETL integration, and without any other IT project, everyday business users can put that data to work immediately. The machinery to support this is now appearing, and users' ability to harness data is undergoing a generational shift.
There is no longer a need for a traditional data warehouse. It is an inflexible, expensive relic of a bygone age. It is time to leave the dark ages.
Ben Werther is the Founder & CEO of Platfora, the company behind the first in-memory business intelligence platform for Hadoop. He is an industry veteran and big data thought leader and was head of products at Greenplum through the EMC acquisition.
Some people suggest that relational database management systems (RDBMS), and data warehouse built on top of them, are no longer needed. In fact, some argue that new technologies like Hadoop can do the job of the Data Warehouse at a fraction of the time and cost -- and, by the way, Hadoop is "free."
We can't blame some for wanting to believe the argument.
Before hitting the arguments, let me say that Hadoop has an important part in the future analytics environment because it provides a big data refinery, which can bring in massive amounts of raw material (data) -- and more importantly the corresponding analytics. One of the great features of Hadoop is that you can pile information into it without deciding in advance what you need to save or how you intend to use it. As businesses require more precise analytics, Hadoop as a source of new fuel is critical.
The core argument really comes down to a couple of points: 1. Data Warehouses are too "rigid and inflexible," and 2. The "community" will fix all of the limitations of Hadoop.
On the surface, these points sound very compelling. But with a deeper look they are misleading and self-contradictory.
Starting with the point about inflexibility of data warehouses, it's important to distinguish the technology, RDBMS, from the practice, data warehousing. Rigid schemas attributed to EDWs -- where the users have to define what they are looking for before starting the search, and where some of the misconceptions stem -- are often the result of rigid IT policy, and sometimes the result of dated or inadequate data warehouse architecture. Rigid structures are not an inherent problem in today's best data warehouse architectures that are designed for analytics.
Is structure bad in analytic environments? No! Imagine what would happen if you ran a public company and every quarter an analyst had to go through piles of un-modeled data, whether in Hadoop or otherwise, to come up with your financial quarterly results. The chance that something would go wrong in this process is too high to allow that uncertainty -- sometimes structure is really good to have!
So, do all these successful enterprises use structure and data models because it is the only way to go in an RDBMS or a Data Warehouse? Of course not. This is not about what a data warehouse can do; this is about what the business needs. Claiming that customers will stop requiring data quality and accurate data models across all their data infrastructure is misleading.
Let's move to the second question. Why would you need a data warehouse if Hadoop is going to support everything from SQL to BI in a year or two?
This claim ignores a simple fact: it took decades of work from some of the most brilliant computer scientists to build databases. Can Hadoop provide and implement the same functionality in a couple of years?
The answer is obviously, no, and it would be a real shame to waste the community's efforts to rebuild existing functionality vs. inventing newer and more extraordinary use cases. And some of the early deliverables in the Hadoop world that purport to eliminate RDBMS's require schemas and have physical design constraints that go against the "flexibility" argument of Hadoop. What's more, these claims leave out the fact that Hadoop was originally not developed for BI or SQL execution. It's like using a hammer when you really want a screwdriver -- let's free Hadoop to be the great tool it was designed to be!
History teaches us that the impact of new technologies is over-estimated in the short-term and underestimated in the long run. Hadoop is not and will not become a data warehouse. RDBMs and data warehouses will thrive, not die, because of Hadoop. We think Hadoop will be an integral part of future analytic data infrastructure solutions, but not the only part!
Scott Gnau is president of Teradata Labs, where he directs all research, development and sales support activities related to data warehousing, big data analytics, and associated solutions.