Shortly after the posting of the last installment of my three-part article on XML, the editor forwarded what he referred to as a "thoughtful response" to it, indicating that he was considering publishing it and inviting me to respond. Now, the three-parter was itself a reply to a reaction from a XML contributor to an earlier article on XML in this series and it took three times as much text to respond to that reaction than to write the original article; similarly, this reply is almost four times longer than the reaction it responds to. This must be so: poorly structured declarations without supporting evidence are easier and shorter, particularly if one unaware of it; informed, coherent, well structured arguments backed by evidence take more time and space. This does not make responding to arm-waving a cost-effective endeavor and it certainly is frustrating.
I made this problem explicitly clear in the third part of my three-parter:
"Unfortunately, the Internet generation of practitioners, many (if not most) of whom are self-taught, have primarily, if not solely, HTML and Java programming skills, if any. They usually have not had formal exposure to data fundamentals in general and the relational model in particular. That is why, as already mentioned, they fail (among other things), to make a proper distinction between a DBMS and an application. It is, therefore, not surprising that they perform data management in application programs. This requires some data model, XML provides one, and, having not been around to experience the complications of hierarchic and application-based data management, they adopt it (without realizing, of course, that this is what they do)."
Consider, in this context, how the reaction forwarded by the editor starts:
"After kicking around the ideas on an XML developer's mailing list, I think that XML advocates could make three points in response." [emphasis mine]
Before I wrote anything about XML, I did some intensive research into the various arguments advanced for it. Ultimately, my references were mainly to a seminal article by Bosak and Bray, considered inventors of XML, in part because most of what is is being said about XML in the industry is fuzzy, confused, inconsistent, irrelevant, or incorrect. So there is hardly a need for my critic to tell me what the proponents' "points" would be, because my criticism was targeted directly at those very "points". The reader should note that XML proponents keep trying to "educate" us relational proponents on XML, when it is we who bother to learn their subject before we assess it, and it is they who criticize our subject without bothering to learn it.
More distressing, though, is the notion that, aside from vendors, "mailing lists of practitioners" are the almost exclusive source of knowledge in the industry. If, as I argue, the main problem is general lack of foundational knowledge by practitioners, how can consulting lists of them provide intelligent responses to fundamental criticism? Lack of any attempt at education (as distinct from just experience) is what makes responding to most reactions futile. As I wrote in last month's column:
"What Marge finds fascinating about the [student's] posting [is that] "within a few hours, many hundreds of people had read the student's thesis and had responded with statements that were truly impressive in their number and quality" (as of this writing, over 100 pages). That, in my experience, is entirely predictable when topics are "hot," which usually means way more heat than light and hardly impressive. The clear, precise and relatively simple relational thinking leads to succinctness, and genuine controversies are quite rare. The opposite is true of OO thinking, which generates volumes of drivel and fake controversies like "impedance mismatch" due to its fuzzy and confused nature."
XML thinking is not much different (in part precisely because it incorporates object thinking).
My critic states:
"Part 3 ... was quite thought provoking ... The most interesting point for me is that the hierarchical model upon which XML is based is essentially the same as the CODASYL model which got trounced in the "great debate" of the 1970's and was replaced by the relational model in virtually all discussions of database theory and practice."
There are several inaccuracies here.
- CODASYL relied on the network data model, a generalization of the hierarchic data model which underlied early DBMSs such as IBM's IMS, and now underlies XML.
- The hierarchic model was not just "trounced in the great debate." Its actual implementations were discarded in practice. It is odd that critics who deem relational technology "just a theory," conveniently refer to the market failure of actual hierarchic products as "loss in a debate" (that is, if they are at all aware that such products had even existed -- many "new" technologies are, like XML, relabeled old concepts, because there is usually no awareness of the past).
- The relational model may have "replaced the hierarchic model in discussions and theory" -- and even that for only just a while -- but certainly not in practice. There has not been a true, correct and complete implementation of the model, precisely because of the lack of foundational knowledge in the industry and the inability of users and the trade media to distinguish between genuine relational features and their bastardization by SQL. The industry overwhelmingly and erroneously considers SQL and its commercial implementations relational, even though they provided very little of the practical benefits of the theory.
In fact, my critic provides an excellent example of this last problem, revealing the common lack of understanding as to what theory is, when he says:
"XML (unlike the relational model) has no pretensions at being a general theory of data. It reflects best practices that have evolved for dealing with documents and other "semi-structured" types of data."
Exactly! This is precisely the problem! Here's a similar reaction from a DBMS designer when I pointed out that, having not implemented data types, his product did not support a full data model and, therefore could not be deemed a true DBMS:
"... I think I have a crude understanding of ... "what [data] types are and what their function in a data model is." Suneido DBMS does not implement these concepts, nor does it make any claims to ... like its language, is dynamically typed i.e. database columns (fields) do not have fixed types -- they can hold any type of value."
Seems like if you don't claim to have a sound foundation for what you do, there will be no consequences. This is the level of knowledge prevalent in the industry and I can only follow Chris Date in offering a quote by a substantial mind on this subject:
"Those who are enamored of practice without theory are like a pilot who goes into a ship without rudder or compass and never has any certainty where he is going. Practice should always be based upon a sound knowledge of theory." -- Leonardo da Vinci
What is even more fascinating -- and beautifully validates both the fallacy and consequences of operating without a theoretical compass -- is that the most infamous hierarchic DBMS, IBM's IMS, was developed exactly in the same way as XML: it reflected existing practices that had evolved for dealing with data, rather than being built on a sound theoretical foundation (more on why below). How exactly are "best" practices selected in the absence of theory? And if the best were selected, why was IMS discarded? It was, of course, specifically the lack of a theoretical foundation that made IMS, among other things, very complex and inflexible and discarding it was an enormously costly proposition (so much so, that many organizations are still trapped into using it today). There cannot be any better evidence that, as I titled one of my previous articles in this series, those who forget or ignore the past are condemned to repeat it. The IT industry just does not learn (for reasons why see my article The Myth of Market-based Education.
"Medical records are probably good examples -- they contain some structured data (personal identification information, summaries of lab tests at different points in time) and some free text (case histories, physician's notes, orders, etc.) Many such "documents" have a different notion of "integrity" than typical databases have. In a document (especially one with some legal status, such as a contract, order, prescription, etc.) the important thing is for the retrieved version to precisely match the version that was originally stored. Thus, if a patient's address changes between the time a prescription is ordered and the time some audit occurs, it would be a "bad thing" if the new address was retrieved as part of the prescription, because any digital signature applied to the original would be broken. In a "database," integrity generally refers to the consistency between the various views of the underlying data, so if we change the patient's address in the "medical record" view, we want to see that properly reflected in the "prescription" view. The hierarchical model of data is indeed inferior for the latter "database" case, but XML is well-suited for digitally representing "documents" in a searchable, transactionable database or for transporting them between disparate hardware/software systems."
This is a classic example of what I mean when I say that XML proponents try to do database management without much knowledge of basic data fundamentals. I urge my critic to educate himself on integrity before he comments on it publicily -- there is a chapter on it in Practical Issues in Database Management. The response to his comments can be found in Chapter 1 of the book covering data types. My critic seems to imply that "free text" is not "structured" data. (What exactly is "semi-structured" data?) Indeed, the industry generally refers to text, images, audio and so on as "unstructured data." But truly unstructured data would be random data and, therefore would not have any informational value. It is more accurate to say that data of value is always structured -- the value lies precisely in the structure, which infuses it with meaning -- but in different ways. And indeed, so-called "complex" data involve different integrity constraints and manipulation than "simple" data" -- not "different notions thereof"! -- which is precisely the crux of the matter. But, as I explain in the book, this is an issue orthogonal to the data model employed. Any data model must contend with this issue and neither XML, nor any other quasi-model advanced in the industry, has any superiority in this regard. Claims to the contrary are simply due to poor understanding of what a data model is.
About the Author
Fabian Pascal has an international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. His third book is called Practical issues in database management: A guide for the thinking practioner.
For More Information