Chris Date has long been a champion of relational theory, and his latest book, Database in Depth: Relational Theory for Practitioners is aimed at database professionals who "are honest enough to admit they don't understand the theory underlying their own field as well as they might, or should." In this excerpt, Date defines the two meanings of the term "data model" and explains why the differentiation should matter to you. Read the complete chapter here.
A review of the original model
You're a database professional, so you already have some familiarity with the relational model. The purpose of this section is to serve as a kickoff point for our subsequent discussions; it reviews some of the most basic aspects of that model as originally defined. Note the qualifier "as originally defined"! One widespread misconception about the relational model is that it's a totally static thing. It's not. It's like mathematics in that respect: mathematics too is not a static thing but changes over time. In fact, the relational model can itself be seen as a small branch of mathematics; as such, it evolves over time as new theorems are proved and new results discovered. What's more, those new contributions can be made by anyone who's competent to do so. Like mathematics again, the relational model, though originally invented by one man,has become a community effort and now belongs to the world.
By the way, in case you don't know, that one man was E.F. Codd, at the time a researcher at IBM. It was late in 1968 that Codd, a mathematician by training, first realized that the discipline of mathematics could be used to inject some solid principles and rigor into the field of database management, which was all too deficient in such qualities prior to that time. His original definition of the relational model appeared in an IBM Research Report in 1969, and I'll have a little more to say about that paper in Appendix B.
Model versus implementation
Before going any further, there's one very important point I need to explain, because it underpins everything else in this book. The relational model is, of course, a data model. Unfortunately, however, this latter term has two quite distinct meanings in the database world. The first and more fundamental meaning is this:
Definition: A data model (first sense) is an abstract, self-contained, logical definition of the data structures, data operators, and so forth, that together make up the abstract machine with which users interact.This is the meaning we have in mind when we talk about the relational model in particular. And, armed with this definition, we can usefully (and importantly) go on to distinguish a data model in this first sense from its implementation, which can be defined as follows:
Definition: An implementation of a given data model is a physical realization on a real machine of the components of the abstract machine that together constitute that model.I'll illustrate these definitions in terms of the relational model specifically. First, and obviously enough, the concept of relation is itself part of the model: users have to know what relations are, they have to know they're made up of tuples and attributes, they have to know how to interpret them, and so on. All that is part of the model. But they don't have to know how relations are physically stored on the disk, or how individual data values are physically encoded, or what indexes or other access paths exist; all that is part of the implementation, not part of the model.
Or consider the concept join: users have to know what a join is, they have to know how to invoke a join, they have to know what the result of a join looks like, and so on. Again, all that is part of the model. But they don't have to know how joins are physically implemented, or what expression transformations take place under the covers, or what indexes or other access paths are used, or what physical I/O's occur; all that is part of the implementation, not the model.
In a nutshell, then:
- The model (first meaning) is what the user has to know .
- The implementation is what the user doesn' t have to know.
Here are some important consequences of the foregoing definitions. First, note that performance is fundamentally an implementation issue, not a model issue -- despite extremely common misconceptions to the contrary. We're often told, for example, that "joins are slow." But such remarks make no sense!! Join is part of the model, and the model as such can't be said to be either fast or slow; only implementations can be said to possess any such quality. Thus, we might reasonably say that some specific product X has a faster or slower implementation of some specific join than some other specific product Y -- but that's all.
I don 't want to give the wrong impression here. It's true that performance is basically an implementation issue; but that doesn't mean a good implementation will perform well if you use the model badly! Indeed, this is precisely one of the reasons why you need to know the model (I mean, so that you don' t use it badly). If you write an expression such as S JOIN SP, you're within your rights to expect the implementation to do a good job on it; but if you insist on (in effect) hand-coding the join yourself, perhaps like this:
do for all tuples in S ; fetch S tuple into TNO,TN,TS,TC ; do for all tuples in SP with SNO =TNO ; fetch SP tuple into TNO,TP,TQ ; emit tuple TNO,TN,TS,TC,TP,TQ ; end ; end ;then there's no way you're going to get good performance. Relational systems should not be used like simple access methods.
Second, as you probably realize, it's precisely the fact that model and implementation are logically distinct that enables us to achieve data independence. Data independence (not a great term, by the way, but we're probably stuck with it) means we have the freedom to change the way the data is physically stored and accessed without having to make corresponding changes in the way the data is perceived by the user. The reason we might want to change those storage and access details is, of course, performance; and the fact that we can make such changes without having to change the way the data looks to the user means that existing application programs, queries, and so on can still work. Very importantly, therefore, data independence means protecting your investment in user training and applications.
Now I want to turn to the second meaning of the term data model, which I dare say you're very familiar with. It can be defined thus:
Definition: A data model (second sense) is a model of the persistent data of some particular enterprise.In other words, a data model in the second sense is just a (possibly somewhat abstract) database design. For example, we might speak of the data model for some bank, or some hospital, or some government department.
Having now explained these two different meanings, I'd like to draw your attention to an analogy that I think nicely illuminates the relationship between them:
- A data model in the first sense is like a programming language, whose constructs can be used to solve many specific problems, but in and of themselves have no direct connection with any such specific problem.
- A data model in the second sense is like a specific program written in that language -- it uses the facilities provided by the model, in the first sense of that term, to solve some specific problem.