Chris Date has long been a champion of relational theory, and his latest book, Database in Depth: Relational Theory for Practitioners is aimed at database professionals who "are honest enough to admit they don't
understand the theory underlying their own field as well as they might, or should." In this excerpt, Date defines the two meanings of the term "data model" and explains why the differentiation should matter to you. Read the complete chapter here.
A review of the original model
You're a database professional, so you already have some familiarity with the relational
model. The purpose of this section is to serve as a kickoff point for our subsequent
discussions; it reviews some of the most basic aspects of that model as originally defined.
Note the qualifier "as originally defined"! One widespread misconception about the
relational model is that it's a totally static thing. It's not. It's like mathematics in that
respect: mathematics too is not a static thing but changes over time. In fact, the relational
model can itself be seen as a small branch of mathematics; as such, it evolves over time as
new theorems are proved and new results discovered. What's more, those new
contributions can be made by anyone who's competent to do so. Like mathematics again,
the relational model, though originally invented by one man,has become a community
effort and now belongs to the world.
By the way, in case you don't know, that one man was E.F. Codd, at the time a
researcher at IBM. It was late in 1968 that Codd, a mathematician by training, first
realized that the discipline of mathematics could be used to inject some solid principles
and rigor into the field of database management, which was all too deficient in such
qualities prior to that time. His original definition of the relational model appeared in an
IBM Research Report in 1969, and I'll have a little more to say about that paper in
Appendix B.
Model versus implementation
Before going any further, there's one very important point I need to explain, because it
underpins everything else in this book. The relational model is, of course, a data model.
Unfortunately, however, this latter term has two quite distinct meanings in the database
world. The first and more fundamental meaning is this:
Definition: A data model (first sense) is an abstract, self-contained, logical definition of
the data structures, data operators, and so forth, that together make up the abstract
machine with which users interact.
This is the meaning we have in mind when we talk about the relational model in
particular. And, armed with this definition, we can usefully (and importantly) go on to
distinguish a data model in this first sense from its implementation, which can be defined as
follows:
Definition: An implementation of a given data model is a physical realization on a
real machine of the components of the abstract machine that together constitute
that model.
I'll illustrate these definitions in terms of the relational model specifically. First, and
obviously enough, the concept of relation is itself part of the model: users have to know
what relations are, they have to know they're made up of tuples and attributes, they
have to know how to interpret them, and so on. All that is part of the model. But they
don't have to know how relations are physically stored on the disk, or how individual
data values are physically encoded, or what indexes or other access paths exist; all that is
part of the implementation, not part of the model.
Or consider the concept join: users have to know what a join is, they have to know how
to invoke a join, they have to know what the result of a join looks like, and so on. Again,
all that is part of the model. But they don't have to know how joins are physically
implemented, or what expression transformations take place under the covers, or what
indexes or other access paths are used, or what physical I/O's occur; all that is part of the
implementation, not the model.
In a nutshell, then:
The model (first meaning) is what the user has to know .
The implementation is what the user doesn' t have to know.
(Of course, I'm not saying users aren't allowed to know about the implementation; I'm
just saying they don't have to. In other words, everything to do with the implementation
should be, at least potentially, hidden from the user.)
Here are some important consequences of the foregoing definitions. First, note that
performance is fundamentally an implementation issue, not a model issue -- despite
extremely common misconceptions to the contrary. We're often told, for example, that
"joins are slow." But such remarks make no sense!! Join is part of the model, and the
model as such can't be said to be either fast or slow; only implementations can be said to
possess any such quality. Thus, we might reasonably say that some specific product X has
a faster or slower implementation of some specific join than some other specific product
Y -- but that's all.
I don 't want to give the wrong impression here. It's true that performance is basically an implementation issue; but that doesn't mean a
good implementation will perform well if you use the model badly!
Indeed, this is precisely one of the reasons why you need to know the
model (I mean, so that you don' t use it badly). If you write an expression such as S JOIN SP, you're within your rights to expect the implementation to do a good job on it; but if you insist on (in effect) hand-coding the join yourself, perhaps like this:
do for all tuples in S ;
fetch S tuple into TNO,TN,TS,TC ;
do for all tuples in SP with SNO =TNO ;
fetch SP tuple into TNO,TP,TQ ;
emit tuple TNO,TN,TS,TC,TP,TQ ;
end ;
end ;
then there's no way you're going to get good performance. Relational
systems should not be used like simple access methods.
Second, as you probably realize, it's precisely the fact that model and implementation are
logically distinct that enables us to achieve data independence. Data independence (not a
great term, by the way, but we're probably stuck with it) means we have the freedom to
change the way the data is physically stored and accessed without having to make
corresponding changes in the way the data is perceived by the user. The reason we might
want to change those storage and access details is, of course, performance; and the fact
that we can make such changes without having to change the way the data looks to the
user means that existing application programs, queries, and so on can still work. Very
importantly, therefore, data independence means protecting your investment in user training
and applications.
As you can see from the foregoing definitions, the distinction between model and
implementation is really just a special case (a very important special case) of the familiar
distinction between logical and physical. Sadly, however, most of today's database systems,
even those that claim to be relational, don't make those distinctions as clearly as they
should. As a direct consequence, they deliver far less data independence than they
should, and far less than relational systems are theoretically capable of. I'll come back to
this issue in the next section, as well as in Chapter 7.
Now I want to turn to the second meaning of the term data model, which I dare say you're
very familiar with. It can be defined thus:
Definition: A data model (second sense) is a model of the persistent data of some particular enterprise.
In other words, a data model in the second sense is just a (possibly somewhat abstract)
database design. For example, we might speak of the data model for some bank, or some
hospital, or some government department.
Having now explained these two different meanings, I'd like to draw your attention to an
analogy that I think nicely illuminates the relationship between them:
A data model in the first sense is like a programming language, whose constructs can
be used to solve many specific problems, but in and of themselves have no direct connection with any such specific problem.
A data model in the second sense is like a specific program written in that language -- it
uses the facilities provided by the model, in the first sense of that term, to solve some
specific problem.
By the way, it follows from all of the above that if we're talking about data models in the
second sense, we might reasonably speak of "relational models" in the plural or "a"
relational model (with an indefinite article). But if we're talking about data models in the
first sense, then there's only one relational model, and it's the model (with the definite
article). I'll have more to say on this issue in Chapter 8.
TechTarget provides enterprise IT professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective IT purchase decisions and managing their organizations' IT projects - with its network of technology-specific Web sites, events and magazines.