Greenplum's Announcement and the Future of Data Marts
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC)... Basically it makes sense... but the EDC vision isn't quite as new or differentiated as Greenplum ideally would wish one to believe...
Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept -- mixing mine and Greenplum's together -- include:
Data marts aren't just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny. Thus, it would be really cool if business users could have their own analytic "sandboxes" -- virtual or physical analytic databases that they can manipulate without breaking anything else. In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at. Whether or not you agree with that, it's an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it's an empirical fact that many business users have the clout to order up new data marts as well. Consolidating data marts onto one common technological platform has important benefits.
In essence, Greenplum is pitching this story:
Thesis: Enterprise Data Warehouses (EDWs)
Antithesis: Data Warehouse Appliances
Synthesis: Greenplum's Enterprise Data Cloud vision
When put that starkly, it's overstated, not least because Specialized Analytic DBMS != Data Warehouse Appliance
But basically it makes sense, for two main reasons:
Analysis is performed on all sorts of novel data, from sources far beyond an enterprise's core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay. On the other hand, consolidation can be a good idea even when systems don't particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.
Of course, the EDC vision isn't quite as new or differentiated as Greenplum ideally would wish one to believe.
To a first approximation, EDC sounds a lot like what eBay has already built on Teradata equipment. Greenplum's EDC vision also sounds a lot like what Stuart Frost was talking about at DATAllegro, what Dell was planning to build on DATAllegro equipment, and what Stuart continues to talk about now that he's been acquired into Microsoft. Something like EDC can also be presumed to be implicit in the strategies of the other one-size-fits-all vendors -- i.e., Oracle and IBM. Greenplum has only implemented a little more of the EDC vision so far than have other firms, unless you give it credit for being cheap/fast/MPP/running on commodity hardware, but deny that credit to Teradata (specialized hardware, and not cheap in its most popular configurations), Oracle (ditto for Exadata), IBM (also not cheap), or Microsoft/DATAllegro (not released yet). Specifically: In Greenplum Release 3.3, which is being announced today, Greenplum is introducing the (enhanced?) ability for data marts to be spun out as a background operation, while the database otherwise remains functional. As of 3.3, spinning out a data mart is a command-line operation. But in Release 3.4, Greenplum plans to offer a web-based interface for same, at which point the "self-service data mart creation" discussion will become operative. Otherwise, EDC is a roadmap/vision/statement-of-direction much more than it is a fully-baked technical project.
One particular source of potential confusion is Greenplum's emphasis on the buzzphrase self-service (data mart). This seems to be a conflation of two related concepts:
End users should be able to create new data marts themselves. Strictly speaking, I view this ability as useless at most enterprises, and important at very few, because of logistical issues. (Who gives the permissions? Who decides which hardware is used?) That said, useless "end user" tools often wind up being important productivity aids for IT professionals, and this kind of "self-service" would surely be another example. Edit: Hmm. Doug Henschen inspired me to think that over again, and I'm beginning to soften. Suppose users could order up the data mart they want, perhaps test it at a very low processing priority (if they choose), and then send the completed request to IT for approval and provisioning. That would have some value. End users should be able to manage data marts themselves, once created. That's a great idea, full of agility and don't-make-IT-a-roadblock goodness. Data miners and similar analytic professionals commonly have the technical ability to manage a simple database, and should be allowed to do so if it's ensured that they don't break anything for anybody else.
One thing that's needed for this technology to come to full fruition is sophisticated data movement and synchronization. Ideally, some tables in a data mart could be virtual -- views against a central database. But others would be physically recopied from the center, with all the ETL / ELT / ETLT / replication issues that entails. Meanwhile, it's not obvious that the ideal architecture is a simpleminded hub-spoke -- perhaps one should be able to spin data marts out of other marts, perhaps at least somewhat reducing the proliferation of tables and the recopying of data. And it should be easy for administrators to change deployment strategies, e.g., by starting a table out as a view and changing over to making it a physical copy as usage profiles change.
Oliver Ratzesberger of eBay also argues that workload management -- not a current Greenplum strength -- can be crucial. For example, if the CEO wants the CFO to get her an answer TODAY, the fastest approach may be to create an entirely virtual data mart, with very favorable SLAs (Service Level Agreements). More generally, if you're setting up dozens of marts that contain views of the central database, sophisticated SLA management can be essential. There's a big virtualization opportunity here -- but virtualization requires a lot of system management infrastructure.
Related links
My recent post on reinventing business intelligence
Greenplum adviser Joe Hellerstein's pitch for agile data warehousing
Charlie Bachman's "private database" idea, which never went anywhere (pp. 138-139)
Greenplum's EDC and Release 3.3 press releases
An interview with some of Greenplum co-founder Scott Yara's own wordsGreenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC)... Basically it makes sense... but the EDC vision isn't quite as new or differentiated as Greenplum ideally would wish one to believe...
About the Author
You May Also Like