Greenplum's Announcement and the Future of Data Marts - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Software // Information Management
Commentary
6/8/2009
09:20 AM
Curt Monash
Curt Monash
Commentary
50%
50%

Greenplum's Announcement and the Future of Data Marts

Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC)... Basically it makes sense... but the EDC vision isn't quite as new or differentiated as Greenplum ideally would wish one to believe...

Greenplum is announcing today a long-term vision, under the name Enterprise Data Cloud (EDC). Key observations around the concept -- mixing mine and Greenplum's together -- include:

  • Data marts aren't just for performance (or price/performance). They also exist to give individual analysts or small teams control of their analytic destiny.
  • Thus, it would be really cool if business users could have their own analytic "sandboxes" -- virtual or physical analytic databases that they can manipulate without breaking anything else.
  • In any case, business users want to analyze data when they want to analyze it. It is often unwise to ask business users to postpone analysis until after an enterprise data model can be extended to fully incorporate the new data they want to look at.
  • Whether or not you agree with that, it's an empirical fact that enterprises have many legacy data marts (or even, especially due to M&A, multiple legacy data warehouses). Similarly, it's an empirical fact that many business users have the clout to order up new data marts as well.
  • Consolidating data marts onto one common technological platform has important benefits.

In essence, Greenplum is pitching this story:

  • Thesis: Enterprise Data Warehouses (EDWs)
  • Antithesis: Data Warehouse Appliances
  • Synthesis: Greenplum's Enterprise Data Cloud vision

When put that starkly, it's overstated, not least because

Specialized Analytic DBMS != Data Warehouse Appliance

But basically it makes sense, for two main reasons:

  • Analysis is performed on all sorts of novel data, from sources far beyond an enterprise's core transactions. This data neither has to fit nor particularly benefits from being tightly fitted into the core enterprise data model. Requiring it to do so is just an unnecessary and painful bureaucratic delay.
  • On the other hand, consolidation can be a good idea even when systems don't particularly interoperate. Data marts, which commonly do in part interoperate with central data stores, have all the more reason to be consolidated onto a central technology platform/stack.

Of course, the EDC vision isn't quite as new or differentiated as Greenplum ideally would wish one to believe.

  • To a first approximation, EDC sounds a lot like what eBay has already built on Teradata equipment.
  • Greenplum's EDC vision also sounds a lot like what Stuart Frost was talking about at DATAllegro, what Dell was planning to build on DATAllegro equipment, and what Stuart continues to talk about now that he's been acquired into Microsoft.
  • Something like EDC can also be presumed to be implicit in the strategies of the other one-size-fits-all vendors -- i.e., Oracle and IBM.
  • Greenplum has only implemented a little more of the EDC vision so far than have other firms, unless you give it credit for being cheap/fast/MPP/running on commodity hardware, but deny that credit to Teradata (specialized hardware, and not cheap in its most popular configurations), Oracle (ditto for Exadata), IBM (also not cheap), or Microsoft/DATAllegro (not released yet).
  • Specifically: In Greenplum Release 3.3, which is being announced today, Greenplum is introducing the (enhanced?) ability for data marts to be spun out as a background operation, while the database otherwise remains functional. As of 3.3, spinning out a data mart is a command-line operation. But in Release 3.4, Greenplum plans to offer a web-based interface for same, at which point the "self-service data mart creation" discussion will become operative. Otherwise, EDC is a roadmap/vision/statement-of-direction much more than it is a fully-baked technical project.

One particular source of potential confusion is Greenplum's emphasis on the buzzphrase self-service (data mart). This seems to be a conflation of two related concepts:

  • End users should be able to create new data marts themselves. Strictly speaking, I view this ability as useless at most enterprises, and important at very few, because of logistical issues. (Who gives the permissions? Who decides which hardware is used?) That said, useless "end user" tools often wind up being important productivity aids for IT professionals, and this kind of "self-service" would surely be another example. Edit: Hmm. Doug Henschen inspired me to think that over again, and I'm beginning to soften. Suppose users could order up the data mart they want, perhaps test it at a very low processing priority (if they choose), and then send the completed request to IT for approval and provisioning. That would have some value.
  • End users should be able to manage data marts themselves, once created. That's a great idea, full of agility and don't-make-IT-a-roadblock goodness. Data miners and similar analytic professionals commonly have the technical ability to manage a simple database, and should be allowed to do so if it's ensured that they don't break anything for anybody else.

One thing that's needed for this technology to come to full fruition is sophisticated data movement and synchronization. Ideally, some tables in a data mart could be virtual -- views against a central database. But others would be physically recopied from the center, with all the ETL / ELT / ETLT / replication issues that entails. Meanwhile, it's not obvious that the ideal architecture is a simpleminded hub-spoke -- perhaps one should be able to spin data marts out of other marts, perhaps at least somewhat reducing the proliferation of tables and the recopying of data. And it should be easy for administrators to change deployment strategies, e.g., by starting a table out as a view and changing over to making it a physical copy as usage profiles change.

Oliver Ratzesberger of eBay also argues that workload management -- not a current Greenplum strength -- can be crucial. For example, if the CEO wants the CFO to get her an answer TODAY, the fastest approach may be to create an entirely virtual data mart, with very favorable SLAs (Service Level Agreements). More generally, if you're setting up dozens of marts that contain views of the central database, sophisticated SLA management can be essential. There's a big virtualization opportunity here -- but virtualization requires a lot of system management infrastructure.

Related links

Slideshows
How to Land a Job in Cloud Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  6/19/2019
Commentary
How to Convince Wary Customers to Share Personal Information
John Edwards, Technology Journalist & Author,  6/17/2019
Commentary
The Art and Science of Robot Wrangling in the AI Era
Guest Commentary, Guest Commentary,  6/11/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
A New World of IT Management in 2019
This IT Trend Report highlights how several years of developments in technology and business strategies have led to a subsequent wave of changes in the role of an IT organization, how CIOs and other IT leaders approach management, in addition to the jobs of many IT professionals up and down the org chart.
Slideshows
Flash Poll