Breakthrough Analysis: A Data Space for Information Coexistence
Rather than trying to force conformity, a "data space" might allow disparate information to co-exist.
Sixty years of computing have created the tantalizing prospect of "360-degree views" and "total information awareness." These notions of knowing everything about a subject, regardless of information source or form, are compelling. They seek nothing short of complete predictability through comprehensive knowledge--a consolidated store of information available to any organization that can pay the freight. But you'll never get to this Memory Alpha (for you Trekkies) if the data you need is dispersed across operational databases and data warehouses, dozens of spreadsheets and thousands (or more) documents on PDAs, desktop machines, servers and the Web. That's how diversity-embracing "data space" abstraction comes into play.
Three computer scientists, writing in the December 2005 SIGMOD Record, proposed the term data space to describe the collection of disparate information that represents and is used by individuals and organizations. Although the concept isn't new--I've been tracking use of the term by another computer scientist, Professor Robert Grossman, for several years--the authors nonetheless offer a far-reaching "new agenda for data management," one that provides a framework for a view of both queryable, structured databases and searchable, "unstructured" documents that is unified by the highest possible degree of semantic and administrative integration.
But where Professor Grossman defines data space as the contents of the nodes of a "data web" assembled for distributed data mining, the SIGMOD Record authors, Michael Franklin, Alon Halevy and David Maier, go further in proposing the development of generalized Data Space Support Platforms. DSSPs insulate the user from the challenges otherwise inherent in accessing diversely formatted, described and managed data available via disparate but interrelated services. Where Grossman's work is practical, focused on crafting high-bandwidth protocols that operate over grids and conventional networks, with the specific goal of enabling distributed data mining, Franklin, Halevy and Maier aim at something more abstract and general.
Their analysis acknowledges that much of the information we use is outside our administrative control. It's in someone else's database or files. It's described by someone else's metadata schema (or none at all) and therefore possesses a low level of semantic integration (or common definitions) with other information that interests us. These are the conditions that launched Google and the other search giants. It's hard to find documents and even harder to find meaning, whether on your desktop or on the Internet. Per Franklin, Halevy and Maier, we should move toward data coexistence rather than enforced conformity.
If we can't pull all the information we need into our own, semantically uniform databases, how about pushing capabilities out to the dizzying array of devices that comprise evolving data webs? Why not put everything into databases or under DBMS control? Add in support for complex processing (such as data mining) and workflow management so you can use your DBMS to orchestrate distributed computing processes. This is the current version of the dream that object-relational databases, the supposed next great wave in database technology in the mid-'90s, were designed to enable.
Writing in Queue magazine last year, database and transaction-processing pioneer Jim Gray of Microsoft Research asserted that DBMSs can do it all. They can host diverse data types as well as abstractions such as data cubes. They can optimize complex queries, make sense of workflows, process semistructured and streaming data, and embed specialized code written in portable programming languages. With further advances in handling inexact, approximate reasoning and in structuring databases to offer Web services, we'll be able to distribute robust, DBMS-centered operating platforms to servers, desktops and other devices to integrate data stores and mediate access.
My take is that a network of interconnected database environments would make an ideal data web, but one that will never be close to completely realized. The programmer in me says that practical, task-oriented approaches like Grossman's are the way to get stuff done. Regardless of how it's realized, the data-space concept provides an excellent framework for work toward robust knowledge networks.
Seth Grimes is a principal of Alta Plana Corp., a Washington, D.C.-based consultancy specializing in large-scale analytic computing systems. Write to him at [email protected].
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.