From ISAM to XML, data has proven to be the power that runs businesses. Back in August 2001, Pervasive Software's Ken White gave us a roundup on the state of data storage and management, and his predictions for where we were headed. Let's see how prescient his thoughts were....
DBMS Past, Present, and Future
Where databases have been, where they are going, and what it means to you
by Ken White
In the late 1800s, Thomas Edison and George Westinghouse became embroiled in what has become known as "The War of the Currents." Edison had invested heavily in infrastructure, supporting the use of direct current for the distribution of electricity. Westinghouse, having bought patents on the inventions of Tesla, advocated alternating current. For almost a decade, early adopters of electricity were forced to endure a spectacle that included, among other things, public electrocutions of large animals (performed by Edison to show the danger of alternating current; he tried to popularize the slang "Westinghoused" as a synonym for "electrocuted"). For a short period of time, there were two sets of infrastructure that operated under different assumptions about how power should be transported and consumed. Fortunately, the technology was young and the infrastructure was immature, so the cost of competing standards was relatively low. Even so, consumers of electricity were forced to make an expensive choice, balancing the investment in existing infrastructure against superior technology.
Fast forward about 100 years to Christmas Day, 1981. Under the tree waiting for me was a new TI 99/4A home computer. With 16 glorious kilobytes of memory and a blazing 3-MHz processor (not to mention a built-in Basic interpreter), the 99/4A was a future techie's dream come true. Like most computing machinery, the components inside the 99/4A required direct current. Since my house was wired for alternating current at the time (at 60 Hz, no less), this resulted in a mismatch of assumptions. The good news was that through amazing foresight, the engineers at Texas Instruments had included with the computer a transformer, which was capable of adapting the power from the outlet for use with the 99/4A. A year or so later, TI discovered the possibility that if the transformer malfunctioned, a user touching the metal case of a 99/4A could be Westinghoused. You can only imagine Edison screaming, "I told you so!" as TI spent the cost of distributing replacement transformers.
Data is the power that runs a modern business. When consumers of that power have different assumptions, a transformation is required. Very bad things (including the loss of infrastructure investment) can happen if that transformation is not carefully planned. Currently, there are at least three major data management paradigms - ISAM, SQL/Relational, and XML - in use, with XML poised to explode. Each has different assumptions regarding how data models the organization's view of the real world. When different models try to operate on the same source of power, they must reconcile these differences. The process of reconciling these assumptions can result in data loss, performance degradation, system fragility, or feature unavailability. The ideal solution is for every consumer to enjoy native and natural access to the power source, without one model compromising another.
Database Management Systems
Data management can be thought of as having two principal ingredients - a data model and a system for implementing that model. The data model is responsible for describing the principles by which the data will represent the outside world. This model is implemented through a database management system (DBMS). Over time, DBMS has come to mean a software package that handles database management requirements.
Because data is the power for business, DBMSs (and their associated processes) tend to evolve more slowly than other, less mission-critical segments of computing infrastructure. In fact, database management models have remained relatively unchanged through the emergence and explosion of the Internet. It usually requires a significant shift in business practice to engender any change in database management. Distributed computing (such as web services) appears to be one such shift; it is growing rapidly and many major vendors expect distributed computing to be the next major paradigm in application development. XML is establishing the system for distributed data management, and this system does not neatly fit with the assumptions made by existing infrastructure.
Much as there are some fundamental electromagnetic properties that are the same across different types of current, there are some fundamental data management issues that are addressed by all data models. Abstracting these common issues is the first step to understanding the costs of data model differences, as well as possible remedies. The next step is to determine the assumptions and rules each model places around the basic abstractions; those assumptions and rules are the things that don't always match. The points at which the assumptions differ not only help to describe the problem, but also serve as the first step for understanding and evaluating available solutions. Finally, the mismatched assumptions and the shortcomings of the current solutions serve to help define what a future solution might look like.
Database Concepts of Concern
The first assumptions involve three relatively basic (and familiar) database concepts: entities, attributes, and relationships. (Although some consider the Entity Relationship model to be a data model, I don't treat it as such here. For one thing, there is no significant implementation of E-R as a model separate from the other models I discuss. The value of E-R for purposes here is that it provides useful terminology for abstracting the common elements of the models that will be discussed.) In addition, each model addresses collections, operational efficiency, and the relationship of these concepts to the rest of the computing environment. There are other issues involved with data management (concurrency, operation atomicity, relational integrity, and so on), but these issues are not necessarily differentiating factors between models.
The first basic concept is that of entity - the thing that is being stored and is representative of something in the external world, such as a customer, invoice, or inventory item. It may be thought of as the most granular representation of data that retains context. The question of what exactly comprises an entity is more frequently resolved through data/business analysis than through application of the normalization formulas found in database textbooks.
The second concept is that of attribute - a descriptor of an entity. If you have a customer entity, you can use attributes to keep track of important descriptors, such as a person's favorite games or whether he collects butterflies. Depending on your particular prejudices, you may think of attributes as fields or columns. Attributes rely on the entity for context. Is the attribute "Zip Code = 11111" meaningful by itself? Does it refer to a vendor or a customer? Which vendor or customer?
The third concept is that of relationships. A customer entity and three order entities are useless in a business process unless you have some way of making their relationship persistent; that is, some way of denoting that the order entities were made by the person represented by the customer entity. In relational theory, this can be represented by foreign key relationships.
Obviously, a database would not be of much use if it allowed you to have one customer entity, one order entity, and one inventory entity. It would be almost as useless if it did not let you access the collection of customer entities independently from the collection of order entities. Inherent in a database paradigm is the grouping of each of these three concepts: Who are all of my customers? What are the descriptors that I've stored for a particular customer? What other entities are important in relation to this customer and how do I access them?
Although a flat text file could act as the basis of these concepts, the practical requirement is that specific information can be quickly and efficiently retrieved. It is critical that the performance does not degrade as more entities are added. This performance requirement is most often met through the use of indexes or keys.
Finally, there is the relationship between the database system and the rest of the computing environment - in particular, the operating system and application. Early on, the database paradigm was represented by a set of procedures and coding standards that dictated how a particular shop's application code interacted with operating-system code. The evolution of database systems has seen at least one consistent trend: the abstraction of the database paradigm from application and operating-system constraints and the encapsulation of that abstraction within a database infrastructure. In other words, relational database application developers typically no longer worry about the offset and length of a particular attribute or the particular OS file that contains the attribute data. Those issues are abstracted within a database infrastructure that generally is viewed as, if not a black box, then a really, really dirty one with very tiny windows. While this trend is common throughout computing and may seem self evident, it has implications for the future state of data management.
Stages of Database Evolution
Database technology has evolved through several stages, including ISAM, SQL/Relational, and XML:
ISAM. Although ISAM has not been formally standardized as a data model, thanks to the dominance of Cobol and the effect of that dominance on database management, there is a common set of well-understood expectations for an ISAM DBMS. In the ISAM paradigm, entities are records. Attributes are understood to be data stored starting at a specific offset for a specific length. The application is responsible for maintaining relationships, usually performed in much the same way as the relational model, where entities are collected in OS files, and the application (and thus the developer) is responsible for knowing which set of records is in which file. The application can include multiple types of records in one file, but any differences in entity type within a file must be implemented, understood, and maintained by the application. The DBMS does not understand any distinction between different entity types within the same entity collection.
Efficiency is achieved through the use of indexes. Since the DBMS is responsible for maintaining index information and the DBMS does not make any distinction between entity types within an entity collection, an ISAM file indexes the same attributes for an entire collection. This can result in added responsibility for the application if multiple entity types are in the same collection. Furthermore, since the DBMS is unaware of any nonindexed attributes of the entity, the same entity can be viewed as having several different compositions, and there is no guarantee that the attribute indexed by the DBMS is an attribute that is meaningful to the application. (For instance, an index can contain pieces of several attributes, which then will act as a primary key, albeit one that has no true meaning in the real world.)
An ISAM application acts as if it is operating on the physical representation of the record (which it is, in most implementations). As a result, much of the database management of the ISAM paradigm is closely tied to both the operating system and the application.
SQL/Relational. From a theoretical standpoint, the SQL paradigm and relational model are not synonymous. In fact, SQL can be used to build result sets that do not meet relational requirements. However, the average computing professional is not interested in purely theoretical DBMSs, and when most people use a relational database, they are almost invariably using SQL to manipulate the data (whether directly or under the covers, as is often the case with ADO).
In the practical relational view, entities are represented by rows. Collections of entities are represented as tables (or R-tables, or relations, if you want to be exact). Unlike the ISAM practice, the application has no control, knowledge, or concern with the interaction between entity collections and the operating system. Attributes, represented by columns, are understood by the system, and all entities (rows) within the same collection (table) must share the same attribute composition (columns). The application has no knowledge about the physical structure of attributes within the storage of the entity. All of the application's knowledge of entity composition must conform to the DBMS's view of the data. Collections of entities and attributes may be arbitrarily defined at run time through SQL predicates (as well as through views). Relationships are persisted in much the same way as the ISAM model (the constraints, such as primary key uniqueness, are formalized in the relational model, but the ISAM model is similar in practice).
Indexes are also present in the relational paradigm, and apply to the entirety of the system-defined entity collection (not necessarily to derived relations). The system's understanding of indexes is coupled with the system's understanding of attributes, so unlike ISAM, an index is always composed of attributes that are meaningful to the application.
To summarize, the relational model abstracts the database from the operating system and to some extent from the application. There is no longer an exploitable interaction between the operating system and the DBMS. Furthermore, while the application may have foreknowledge of the database composition, it is incapable of using that knowledge in a manner that is not understood beforehand by the DBMS. An application can also be written that derives all of its information about the database at run time, which is certainly not the case with the ISAM paradigm. This abstraction frees the application and the database administrator from a number of concerns regarding the internals of data management, but it also demands that the application conform to the expectations of the model.
XML. XML is not a database paradigm. It is a portable persistence mechanism. However, in a distributed computing environment, XML fragments may be viewed as small, disconnected databases because the integration task required at the receiving end is equivalent to integrating data from two different database paradigms. Each of these fragments (or minidatabases) is self-containing and conforms to a set of rules. Additional rules have been added by the introduction of XML schema, but these are not necessary to the existence of the XML fragment - it can be understood by an XML parser independently of the additional rules. The additional rules are used primarily to ease the impedance mismatch that occurs when a system of loosely structured data is incorporated into a system of more highly structured data.
From a data perspective, entities are represented as elements within XML, and attributes (surprisingly enough) as element attributes. Relationships are perhaps even more important in XML than in the other models. While there are additional methods to describe relationships, any given entity represented by XML is enlisted in one primary relationship - its place in the hierarchy. An entity cannot be represented in a document without this relationship. Collections of entities are generally worked on as if they were files. Collections of collections of entities (that is, a database) are not defined by XML proper, as standard XML makes the assumption that at any given time there is only one collection of immediate concern. In a striking departure from both ISAM and relational paradigms, entity composition may differ across instances of identically typed entities; that is, two customer descriptions within the same document may look completely different and still be equally meaningful. The compositions and types of the entities that make up a particular collection are understood by the system.
This model is tied less tightly to the application than other models in that a consumer of the data does not need to rely on either application-specific knowledge or a DBMS to understand the entities, attributes, and primary relationships that are represented by XML data.
As Table 1 shows, ISAM, SQL/Relational, and XML each make different assumptions. Thus, if an application works on a collection of entities, it must apply a model to that collection and operate in conformance with that model. Incorporating a different model into an existing infrastructure requires that the differences in these assumptions be addressed.
Working with Data Across Paradigms
It is not uncommon for businesses to have both ISAM and SQL applications running side by side; the expected adoption of distributed computing requires that data can be used across all three paradigms as seamlessly as possible. Much like the AC-DC conversion that takes place inside a transformer, the differences in assumptions often result in a series of transformations that adapt data from one paradigm to another. There are already several available mechanisms for accomplishing this task, although no single method is without drawbacks. Much of the data architecture in Microsoft's .NET initiative is based on the expectation that distributed computing must be integrated with existing data centers. It is essential to understand the trade-offs that are inherent in each approach.
- The pipeline approach (aka mapping). Currently, the most widespread solution is a pipeline approach of mapping and transformation. This works well when the destination has fewer constraints than the source (as is the case with most ISAMSQL pipeline approaches), as it is relatively easy to add constraints and assumptions to a system that is less constrained. The process becomes more difficult when building a less constrained system on top of a more constrained system, as is the case with the ISAMSQLXML pipeline.
There are various implementations of this, but the basic idea behind this approach is as follows: You and I decide that our businesses need to share data and that XML is the medium of choice. I expose operations that you can invoke over the Web to retrieve an XML document, and before you retrieve it, I have to be able to describe to you what I'm going to return and what it means. Before you invoke my operation, you have to determine how you are going to integrate my data into your database in a meaningful way. (The odds against us having a similarly structured database are roughly the same as Gates and Ellison joining forces to support the open-source movement.) You then have to be able to map my XML output to conform to your expectations. If I put customers' names together and you store the first and last name separately (for instance), then you have to be able to break a name apart before committing it to your primary data store.
- Once our initial contract is made and kept, you have additional problems to consider. How do you maintain the integrity of the XML transaction for audit-trail purposes? This means not only understanding hierarchical relationships in your central data store, but also maintaining the exact ordering within the hierarchy. The biggest problem, though, is what happens when my lead developer leaves and I have to hire another one, and she prefers a different schema? What do you do with data you don't understand? What do you do with data that doesn't gracefully fit into your database? As the system scales outward, the likelihood of problems increases.
- Although each of these problems can be ameliorated somewhat by individual fixes, there are several problems that are inherent with this solution. First, since the data is massaged into the more constraining system before it is stored, the transformation is either lossy or generates a great deal of secondary data that must be understood and maintained by the system (and anticipated by the system administrator). Furthermore, if the data does not conform to the expectations of the relational model, it cannot be stored at all. This is the basis of the emphasis placed on XML validation. (Validation doesn't really solve the problem, but allows the application to spot problems earlier than it would otherwise.) Finally, there are potentially numerous transformations that must occur, and in a true pipeline approach, each one of these must be treated independently. This requires computing resources and results in accumulative latency times.
- The .NET framework is an example of the pipeline approach. The architecture proposed by Microsoft has a SQL database (which generally performs pipelined ISAM atomics) returning query results as a disconnected, in-memory database. This database can then be transformed as needed or viewed as XML or as a recordset. The power in this is the flexibility for developers; the weaknesses involve the performance and concurrency issues of the multilayered disconnected approach, the mapping that must be performed beforehand, and the inability to adapt quickly to changes without losing data.
- Document storage. Another way to try to solve the impedance matching problem is the document-centric approach of storing an XML document as a BLOB within a database. This means that as a document comes in, it gets stored as a document within an identifying record. One advantage to this is that it maintains the integrity of the document. This system can also be enhanced so that indexing (or even full-text search capabilities) can be applied to the document, so that it can be rapidly retrieved in part or in whole. One problem with this approach is that because the storage mechanism stays true to the document model of XML access, it incurs all the nonDBMS-friendly problems that exist with current XML practices. For example, concurrency issues get much more complex because as far as the DBMS is concerned, the XML work is being done on the most granular piece of data it is capable of storing (an attribute of an entity). How does a system effectively and efficiently protect part of an attribute? It cannot be done without changing the way the system views an attribute. In other words, this would require a redefinition of a fundamental system implementation to overcome pragmatic design shortcomings (a practice known as "kludging"). Furthermore, since the document is persisted as a document, making changes to one section of the document may force a rewrite of the entire document, which will invoke some of the same performance penalties as writing the document out to the file system through OS calls.
- Native storage. One of the methods of solving these problems that has received a great deal of attention in the industry is that of true XML storage. Typically, an XML document is added to the database and the DBMS is responsible for maintaining document integrity, relationships, and document granularity. Of the solutions, this is the most natural way to deal with XML because the data storage mechanism is written to address the assumptions and constraints of the XML paradigm directly. In a sense, this may be considered the opposite of the pipeline approach. Whereas the pipeline approach attempts to force XML data to conform to the relational paradigm, the native storage approach uses the XML paradigm as the reference database storage paradigm.
- Usually, native storage systems will directly address all of the problems mentioned in the other mechanisms. The problems with this method lie in the other direction: How do you still use the business-critical data you have that is currently stored by the ISAM or relational paradigm? Can you afford to convert your infrastructure? This approach assumes that the XML paradigm is your reference model. If you decide to convert all of your data, you will likely have problems with the expectations of your legacy application. Remember, the three paradigms have several areas in which they have significantly different expectations of the application.
- Most XML databases have connectors between the XML data store and a relational data store, whereby data may be accessed from the relational store. However, this solution means that there are two databases - one for XML data and another for relational data. Like the pipeline approach, this requirement results in mapping strategies (which by their very nature are resistant to change). It may be difficult for tools to query across the two databases and at the least will invoke performance penalties in addition to map maintenance. Additionally, since incoming business transactions (such as invoices) will presumably be coming in through XML, there are integration issues that cannot be avoided; at some point, some of your existing applications will need to view the XML data relationally, and that data must act as part of the same database as your legacy relational data.
- Architectural issues. With all the solutions currently being offered, how do you know which is right for you? Much of the decision involves your need to communicate with legacy databases. If you know that your need to interact with XML data is isolated from your need to work with your legacy relational data, then an XML database solution most directly addresses the paradigm in which you plan to work. The most common cases, though, involve integration of XML data with legacy databases. Inertia would suggest that the majority of adopted solutions are going to be based on the current solution offered by the vendor of the existing legacy store.
- The problem with this is that some of the existing solutions, such as the pipeline approach, will experience scalability problems that may not be apparent upon initial deployment. While it may be easy to handle the relatively few schema changes I face now (working with only my largest partners), what is the likelihood that I will need to scale out and down? In other words, is it a competitive advantage/necessity for my business to be able to interact with more partners and smaller partners - partners who may not be able to afford the same level of schema consistency that I have? If you are in a situation where you plan to work extensively with a relatively few number of partners, you will likely be able to rely on the contract and the pipeline approach may work smoothly. If not, the pipeline approach may introduce a quantity of potential failure points that is unacceptable.
- You should also take into account the number of transformations that are going to be required. Although most operating-system vendors are getting better at scaling through clustering (Microsoft in particular has made claims to the ability of .NET to scale out), reliance on clustering for performance improvement is often misguided. In addition to administration issues, clustering often introduces performance penalties in gaining additional capacity, a tendency that makes clustering a questionable solution for solving performance problems. An application that will operate exclusively on distributed data would be better served by being architected around its transformation needs than by the traditional approach of being architected around the data-storage mechanism.
- Other considerations include the need for an audit trail (rendering a solution with lossy transformations unacceptable), concurrency needs, transactional needs, and whether the primary operations on the data will be reads or writes. If data is going to be referred to often but modified rarely, the document-centric BLOB approach may be optimal.
What Will Postrelational Databases Look Like?
There are any number of visions about the future of data storage. However, based on the history of database technology and practice, it would seem that a vision should assume at least the following trends:
- Databases will continue to abstract their operations from application and OS dependencies.
- Existing infrastructure investments will demand that data stored in different paradigms can be accessed equally by all consumers.
- Database models will be introduced with little frequency, but new ones will be introduced based on business needs.
- Because they are based on business pressures rather than purely theoretical technical decisions, new database models will not necessarily conform to the assumptions and constraints of previous models.
- There are a number of database concepts that stay relatively untouched across models, even as the assumptions and constraints around those concepts change.
The desired evolution of DBMSs, therefore, could take the form of a Postrelational DBMS that decouples the storage mechanism from the access mechanism. One way to think of this is that there is an underlying DBMS that operates only in terms of very generic database concepts, analogous to an abstract class in object-oriented programming. A DBMS implementation could be derived from this base DBMS that applied the constraints and assumptions of a particular model to the data. Additional DBMS implementations could also be placed on top of the base DBMS, all having native and natural access to the data. Another way to look at this is to think of a data model as a type of view - it is a user's perception of the data. The underlying storage mechanism does not need to necessarily conform to that view internally, but only to ensure that the view remains consistent for the consumers of that view.
While it is true that each implementation would have the additional overhead of the opaque nature of the underlying DBMS interface (and would thus not be able to take advantage of potential performance efficiencies gained by tight coupling), each derived DBMS (or view) would be equally native and act as a peer rather than a predefined stop in a pipeline. Thus, a business could run applications that used different data models concurrently (even within the same application), all of which run with similar and deterministic performance, scalability, and reliability characteristics.
Will this happen? It's difficult to imagine it happening rapidly. The relational model has been dominant for several Internet generations, and many database implementations are inextricable from the SQL/relational model. Likewise, many of the XML solutions are suffering from the growing pains that the relational database solutions went through a long time ago. However, as the need for efficient distributed data management grows both in numbers and in importance, it is likely that there will be a competitive advantage in having a solution such as this, regardless of how sophisticated the pipeline approach becomes. If that is the case, then we can rely on the axiom that computing nature abhors a competitive vacuum.
Organizations have a great deal of investment in their infrastructure incorporating ISAM and relational models. There are a number of competitive advantages that can be gained by distributed computing (such as web services), and the common language of distributed computing is XML. The problem is that the XML model makes different assumptions about data than the ISAM and relational models. The result is that businesses are now tasked with adapting existing infrastructure to a new, incompatible data model more quickly than ever before.
There are several ways to accomplish this, but each has drawbacks. Some of these drawbacks are more likely than others to only show up as the system scales outward; other drawbacks are more obvious. Therefore, it is essential that the integration of distributed data not merely coast along the path of least resistance, but that it proceed in the manner best suited to the needs of the business.
There are several possible futures. As unlikely as it seems, distributed computing may turn out to be a fad. One particular mechanism of model adaptation may improve to the point where it satisfactorily addresses the needs of most businesses. One of the models may evolve to accept the assumptions of the other models, making it the reference model. On the other hand, the importance of distributed computing may simply force businesses to accept the cost of inefficient data transformation. An ideal solution is a DBMS that can apply the constraints of any particular model to the underlying data, allowing existing infrastructure to perform at current levels while providing native and natural support for new models as needed.