In this article I am answering for public benefit some of the questions asked during the chat session on December 11th on SearchDatabase.
1A. Can you name the 'truly relational DBMS' that you have mentioned? 1B. Can you tell us anything about this new technology that makes use of true relational technology?
My reference was not to a relational DBMS per se, but rather to a recently developed technology which can be used to build truly relational DBMSs, as well as other software tools. Unfortunately, I am not at liberty to say much about it at this time. What I can say is that the tables in RDBMSs implemented using this technology would truly resemble mathematical sets and would not have the inherent physical ordering of rows and columns that current SQL products impose on tables. It is, of course, up to the industry to use this technology, but given the way in which it operates, I would not hold my breath (see question 4 below).
2. There are a set of business analysis questions as to whether to establish a separate table for a set of values -- are they limited and known? Do they need validation? Do they have separate descriptive attributes, etc. These should be questions someone should go through with a nontechnical user.
This question is not very clear and indicates fuzzy thinking of the kind which prevails in the industry due to lack of proper education in fundamentals. I can only provide the general principle involved in table design. A more in-depth discussion of these issues is in my book PRACTICAL ISSUES IN DATABASE MANAGEMENT. The design of a logical model -- which consists of tables -- must be preceded by a requirements analysis whose purpose is to define a conceptual or business model. Business modeling is the identification of the pertinent business rules; it is based on subjective perceptions of reality and, therefore, is informal. Business rules at the conceptual level are logically represented in the database by integrity constraints. A database schema is essentially nothing but the sum total of those integrity constraints, so one cannot speak of correct table design except by reference to the set of business rules specified by the business model. Otherwise put, the business rules determine proper table design and, therefore "what values will go into separate tables," to use the asker's expression. The reason so many logical models are poorly designed is precisely because practitioners fail to distinguish between the business and logical levels and do not have appropriate understanding of the business rules and how to map them to the database (see Something to Call One's Own).
3B. In the current state of the art, one might have to "denormalize" in order to wring out performance of these "flawed" implementations. True? 3A. Is there a tool to assist in de-normalizing a schema, or is it done only through experience?
Denormalization cannot be justified in circumstances where updates and, thus, integrity are important, because even in cases where SQL products perform poorly with fully normalized databases -- which is their fault, not that of normalization -- performance gains can only be achieved by ignoring the prohibitive additional integrity burden imposed by denormalized databases (see also question 6). Moreover, while normalization guidelines have a theoretical foundation which makes clear when optimal design -- full normalization -- is achieved, there is no theory to support the reverse process: when do you stop denormalizing, at 4NF, 3NF, 2NF, or 1NF, and why? (see Denormalization for Performance: Et Tu Academia?).
4A. Are you saying then that the major database vendors do not provide the technology within their database products to use full normalization accurately? And, is it the database community that needs to pressure these vendors for such a change? 4B. Should the audience here be RDBMS developers at Microsoft and Oracle? It sounds like the practicing DBAs hands are tied. 4C. Then what you are saying is that this is truly just theory as you have not provided any practical solutions.
That is, indeed, what I am arguing. Fully normalized databases are the "most accurate" logical representations of business models, in the sense that they are the simplest logical representation of a certain type of business rules, of which business model consist. As a purely logical construct, normalization (and denormalization) cannot possibly affect performance, which is determined entirely at the physical level. Any deviation from full normalization imposes a much more complex integrity burden which, if ignored, raises the risk of corruption. But if this additional integrity burden is undertaken, it may well cancel out (and even overtake) performance gains, if any, from denormalization (see What You Don't Know About Normalization Will Hurt You). It follows that if performance with fully normalized databases is poor, this must be due to the physical implementation properties of the DBMS and the database, and the hardware configuration, not to logical design. Vendors are, therefore, responsible for making sure that DBMSs perform best with fully normalized databases; there is no way around that. But this won't happen as long as users -- including DBAs -- believe that the solution is denormalization, ignore the integrity implications, and do not demand the proper solutions from vendors. Their hands will continue to be tied if they don't educate themselves on fundamentals and dismiss them as "just theory" (e.g. comment 4C). Note also that even if vendors provided efficient implementations of truly relational DBMSs, performance could still suffer if DBAs/users failed to design their physical databases correctly.
5. Do the performance issues still exist with a SQL DBMS because the integrity constraints must be implemented in application code?
To the extent that performance problems exist in SQL products, they will be experienced even for those integrity constraints that are enforced by the DBMS in the database. However, when a SQL product does not support certain integrity constraints -- which is common to all SQL implementations -- and they must be enforced by application code, performance problems will be even worse, because the DBMS won't be aware of those constraints and won't integrate them into its optimization process. Relational technology was invented in part to avoid leaving performance optimization to each and every individual application developer, which is a lost cause.
6. Since a data warehouse is not an OLTP scenario (no table updates outside of the ETL process), could you address normalization in the data warehouse?
The notion that data warehouses are "read-only" and their integrity will not be affected by denormalization is somewhat of an illusion. They must be populated, usually more than once, and if no integrity constraints are enforced by the DBMS managing the warehouse at those times, integrity is not guaranteed. As Chris Date points out, users must then be kept informed about the meaning of warehouse data (because the DBMS is unaware of the integrity constraints and does not enforce them) and this process is equivalent to (and much less reliable than) storing constraints. This is particularly true when transaction databases, which serve as sources for the population, are not guaranteed to be correct, in part due to the very fact that they are denormalized themselves. That aside, the logical design of warehouses simply ignores sound design principles altogether, which means they are entirely ad-hoc and arbitrary and do not benefit from any practical advantage of the relational model. There is no sound justification for such designs, their proponents simply do not understand the important distinction between database design and application views and the concept of data independence (see Date's On Business Rules and Integrity Constraints: A Reply to Ralph Kimball).
7. Suppose we have got a truly relational DBMS. Is normalization the complete answer to data integrity? In other words is 5th normal form a guarantee for data integrity?
No! Full normalization means, very loosely, that every entity type in the business model is represented by one logical table in the database. Fully normalized databases, therefore, adhere to the principle 'every non-key column in every table is directly dependent on the key, the whole key and nothing but the key'. Such tables do not exhibit certain kinds of redundancy. The practical implications are that key constraints -- uniqueness -- are sufficient to protect the database from potential corruption due (only) to those specific kinds of redundancy (see Chapter 5 in PRACTICAL ISSUES IN DATABASE MANAGEMENT). Undernormalized databases, on the other hand, "bundle" multiple entity types into single tables, violating the principle and causing those types of redundancies. Note very carefully, though, that, as I explain in chapter 8 in my book, there can be other kinds of redundancy in databases that that are not due to such bundling; e.g., derived columns, or duplicate rows. Such redundancies cause integrity problems of their own that cannot be removed by further normalization -- the logical design will have to be corrected in other ways (see also Date's "Normalization Is Not a Panacea," Database Programming and Design, Vol. 11, No. 4, April 1998). Note also that a true RDBMS will enforce domain, column and database constraints and it's that complete support which guarantees data integrity, not just normalization.
8. Which is your favorite RDBMS?
Since there is no truly and fully relational DBMS available, I can't say I have a favorite DBMS. Most are SQL-based and, as such, are equally poor implementations of the relational model. Those few that are not SQL-based are worse, because they ignore/violate the model altogether.
About the author
Fabian Pascal has an international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for more than 15 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on database technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder and editor of Database Debunkings, a web site dedicated to dispelling prevailing fallacies and misconceptions in the database industry, where C.J. Date is a senior contributor. He has contributed extensively to most trade publications, including Database Programming and Design, DBMS, DataBased Advisor, Byte, Infoworld and Computerworld. His third book, "Practical issues in database management" (Addison Wesley, June 2000), serves as text for a seminar bearing the same name. He can be contacted at firstname.lastname@example.org.
For More Information
- What do you think about this column? E-mail the Editor at email@example.com with your feedback.
- The Best Web Links on the relational model
- Post your technical database questions--or help out your peers by answering them--in our live discussion forums.
- Ask the Experts! Our database design, SQL, Oracle, DB2, SQL Server, metadata, and data warehousing gurus will answer your toughest questions.