This is the second part of a two-part series. Part One can be found here.
"Second, there are many people who have spent quite a number of years attempting to work with semi-structured SGML and XML data in relational databases. They had very significant difficulties, which motivated the RDBMS vendors to add "post-relational" features to support XML. If the relational model is so simple and superior, why is it so hard to actually use in concrete practice for the kinds of things that people are now doing with XML? One person I corresponded with on this subject compared using an RDBMS to store XML with using a Turing Machine to write real-world software OK, we *know* mathematically that it will work in principle, we suspect that there are situations in which being able to prove that it works would be of benefit, but for most real purposes the amount of effort to use an RDBMS for XML or a Turing Machine for programming is enormously greater than the more mundane alternatives. Also, Mr. Pascal himself points out (in Chapter 7 of his most recent book) that while the abstract relational model is well-suited to handling and querying hierarchical data, this is difficult in the SQL implementations currently available to end users (even with proprietary extensions and "post-relational" additions to the SQL standard).
Frankly, I don't know whether to laugh or cry. While I am fundamentally questioning the very need for, and usefulness of, XML -- my critic says "but we are having a problem handling XML data in relational databases" -- the very point of my argument! As I explained so many times, there are no relational databases, only SQL databases, which are very far from being relational. It is their lack of relational fidelity that causes most of the problems: in the case at hand, SQL's lack of proper support for correct relational manipulation of hierarchic data. And the solution is not to regress to already failed hierarchic databases -- and then handle them in SQL to boot -- but to implement true RDBMSs and have them handle truly relational databases. When this is done and "many people" still cannot do what they supposedly can with XML, let's talk; but until then, please give me a break.
Object and XML thinking are chockful of fuzzy thinking and terms such as "semi-structured" and "post-relational". One example I particularly like is:
"Object-oriented systems can be classified into two main categories -- systems supporting the notion of class and those supporting the notion of type ... Although there are no clear lines of demarcation between them, the two concepts are fundamentally different." E. Bertino & L. Martino, "Object Oriented Database Systems: Concepts and Architectures
I can only suggest that my critic take a break from his mailing lists and read Chris Date's What Do You Mean Post-Relational? and perhaps also his earlier multi-part article "Why Is It Important to Think PRECISELY" in his Relational Database Writings 1994-97.
The comment that "we know mathematically that it would work" and all that nonsense does not merit a response. It reflects utter ignorance and lack of appreciation of what formal theory (and science) is. It implies -- whether my critic realizes or not -- that correctness of both databases and answers to queries -- which only theory (logic and math) can guarantee -- are unimportant (see my first article in this series).
Regarding my criticism of SQL and its commercial implementations for failure to support hierarchies (or support them correctly), my critic draws the wrong conclusion. Again, the solution is not XML or any other nonrelational databases; rather, it is the true, correct and complete implementation of the relational model which, as the chapter clearly shows, is quite capable of supporting hierarchies and with much less complexity and many more advantages than XML. If he wants to be taken seriously, my critic should demonstrate exactly how and why XML does a better job than that -- not SQL!! Unless he does that, he's treading water.
"Finally, whatever the formal limitations of the hierarchical model of data, it seems clear to me that most humans can think in terms of hierarchies much more easily than they can think relationally. (Pascal seems to disagree, implying that often hierarchical models are imposed for non-obvious reasons). I'd point to Herbert Simon's classic essay on "The Architecture of Complexity" (in SCIENCES OF THE ARTIFICIAL, 1969, revised and republished many times) to support my position people organize concepts in levels because nature is organized in levels, and these hierarchies of nearly-decomposable sub-systems are much more evolutionarily stable than the alternatives. But whatever the reason, it is just easier to think hierarchically than relationally. As a person named Jeff Lowrey put it on an XML mailing list "human beings understand hierarchies better than they understand relational models. There's a lot of applications developers who refuse to touch SQL ... but a complex OO hierarchy they're okay with. To understand the success of XML is to understand that point, IMHO. Hierarchies organize things better for human perception, even if they're more limited (or at least awkward) in the conceptual models they can represent." Mr. Pascal, and C.J. Date, and others of their persuasion have argued repeatedly that this kind of thinking shows software developer's ignorance of the scientific principles underlying databases. It could be argued that relational purists are at least as ignorant of the psychological and economic reality of systems design and programming as the typical software developers are of the relational algebra!"
I'm afraid that there is no way to characterize the thinking behind this paragraph -- thinking may be too kind a term -- without being accused of being insulting, so I will refrain. Anything could be argued, but not everything could be argued rationally. The hierarchic model of data happens to have a formal basis. It is very telling, however, that the designers of both IMS and (explicitly) XML had to renounce adherence to it, precisely because, in complete contradiction to what my critic claims, that theory, unlike relational theory, is too complex to be practical. Here's an extract from the IMS documentation (with acknowledgement to Chris Date):
"Logically deleting a logical child prevents further access to the logical child using its logical parent. Unidirectional logical child segments are assumed to be logically deleted. A logical parent is considered logically deleted when all its logical children are physically deleted. For physically paired logical relationships, the physical child paired to the logical child must also be physically deleted before the logical parent is considered logically deleted."
But without the theory, correctness can no longer be guaranteed. The advantage of relational theory is that it can guarantee correcteness with maximum simplicity.
There is a tendency by those with a limited education to try and obscure it and to impress by throwing references or quotes around. Because a formal education and general knowledge are usually dismissed in this culture, the chances of getting caught with irrelevant, inaccurate or even inexistent references are slim. The Simon reference is a good example of that. Let me just say that I have a BA in economics, a MA in political science, an ABD in quantitative social science and a MBA and for more than 10 years I studied and taught social systems and behavior, with particular emphasis on their economic underpinnings; I have been in the IT industry for over 13 years. So I am only too painfully cognizant of the "psychological and economic reality of system design and programming". If my critic and "typical software developers" educated themselves on the fundamentals, reality would be different and one does not need to invoke Simon for that.
I will conclude by briefly stating as follows:
- There are situations where the real world is inherently hierarchic (e.g. organizations, bill of materials); in those circumstances -- and only in those circumstances! -- the hierarchy should be represented in the database and, for many reasons, regardless of what Jeff Lowery says, this can be done better relationally than with XML (see Chapter 7, "Climbing Trees in SQL" in Practical issues in database management).
- If the world is not inherently hierarchic, thinking about it hierarchically forces hierarchy on the data, which provides no advantage and causes many unnecessary complications, not the least of which is reduced flexibility.
- Many practitioners "think hierarchically" because products and practices induce such thinking and because they lack an education that exposes them to fundamentals such as logic and relational concepts; in fact, it can be argued that a long tradition of poor products and practices in the absence of education -- as distinct from training -- has been producing practitioners who -- and Lowery is correct here -- are much more comfortable with the unnecessarily complex and difficult, but at a loss with the simple and easy (that it is the former, not the latter, that is profitable in the industry helps).
- XML proponents are unaware of, underestimate, or ignore the complications and problems inherent in the (nonrelational) integrity and manipulation of hierarchic structures -- they base arguments on very simplistic specific instances and only structural considerations, and do not think generally (ask yourself what is more general, tables or trees?); this is reinforced by the "cookbook approach" prevalent in the industry (see my first article in this series).
- At issue is not whether something "will work" -- fuzzyness again -- but rather whether it guarantees correctness and only a sound theoretical foundation can do that. If an ad-hoc alternative seems simpler and easier, it is only because it ignores the complications that would be required to guarantee correctness.
- With regards to the specific issue at hand, experienced has already proved which alternative -- hierarchic or relational -- is both sounder and easier, and it is hardly surprising to see strong believers in the efficiency of the market ignore it when it is inconvenient.
About the author
Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for more than 15 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on database technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder and editor of Database Debunkings, a web site dedicated to dispelling prevailing fallacies and misconceptions in the database industry, where C.J. Date is a senior contributor. He has contributed extensively to most trade publications, including Database Programming and Design, DBMS, DataBased Advisor, Byte, Infoworld and Computerworld and is author of the contrarian column Against the Grain. His third book, "Practical issues in database management" (Addison Wesley, June 2000), serves as text for a seminar bearing the same name. He can be contacted at firstname.lastname@example.org.
For More Information
- What do you think about this column? E-mail the Editor at email@example.com with your feedback.
- The Best Web Links on XML and databases
- Post your technical questions--or help out your peers by answering them--in our live discussion forums.
- Ask the Experts! Our database design, SQL, Oracle, DB2, and SQL Server gurus will answer your toughest questions.