Manage Learn to apply best practices and optimize your operations.

XML data management: Setting some matters straight, Part III

If XML is just a physical format for data exchange why does it incorporate logical constructs such as data model components?

This is the third part of a series. The first and second parts are also available.

The two properties of XML being advanced as uniquely beneficial are its extensible, specializable markup tags and its document tree-structure. To quote Bosak and Bray again, the tags "say what the information is, not what it looks like. For example, label the parts of an order for a shirt not as boldface, paragraph, row and column -- what HTML offers -- but as price, size, quantity and color ..." Note very carefully that what Bosak and Bray refer to here is the interpretation of data -- what the data means -- which is logical, not physical. (And so are "integration with URLs" and "readability".) Leaving aside the highly debatable notion that XML documents are "readable", to the extent that XML is a physical format for data exchange between applications, why is readability relevant? If it is not, then it can actually be argued that text-based data exchange compromises performance for no advantage.

Furthermore, whether XML proponents realize it or not, "what things are, how they are related and how to deal with them" is almost the definition of a data model (more specifically, of the three main components of a data model: data types, data structure/integrity, and data manipulation). A data model is a general theory of data that infuses the data with meaning which, when imparted to the DBMS, lets it "take orders, transmit medical records, even run factories and scientific instruments" via applications. A data model is the logical foundation of database management: there cannot be data management without some underlying data model, but physical formatting of data for exchange purposes does not require a data model. And the fact is that part of a data model does underlie XML: the above mentioned tree-structure is the structural component of the hierarchical data model.

Note: "The W3C XML Query Data Model ... is the foundation of the [not yet specified] W3C XML Query Algebra. Together, these two documents [will] provide a precise semantics of the XML Query Language." Hugh Darwen says: "Now, my eyes light up at the word "algebra" ... Originally, I understood it to mean a set of operations that are closed over some type. That is, every operation in X Algebra operates on zero or more values of type X and returns a value of type X. Hence, set algebra, Boolean algebra, relational algebra and the algebra of numbers that gives us arithmetic. Over what is the XML Query Algebra closed? Nobody has ever given me an answer that makes sense (apart from the occasional, honest "I don't know")."

If XML is just a physical format for data exchange, why does it incorporate logical constructs such as data model components? What is more, due to their horrendous complexity and inflexibility, databases and DBMSs relying on the hierarchical model became obsolete in the 80's, at least technologically. SQL DBMSs based -- albeit insufficiently and in many ways incorrectly -- on the simpler relational data model, based on predicate logic and set mathematics proved superior. In fact, as Chris Date points out, one of Codd's original objectives for the relational model was to simplify data interchange! What is the justification, then, for choosing a more complex, discredited data model for data exchange, when a majority of commonly used DBMSs employ a simpler, sounder and, thus, superior data model? As I demonstrate in Chapter 7 of Practical issues in database management, any true hierarchy (tree) can be represented and better accessed relationally. In fact, because the relational tabular structure is more general than the tree structure, the latter must frequently be forced on data to make it amenable to XML (Bosak and Bray's example of a bank statement makes this quite clear).

Note: Some of the complications of hierarchical database management were due to their exposing the physical level to users and applications, violating data independence (in this sense the characterization of XML -- based as it is on the hierarchical model -- as "physical" is, perhaps, revealing). Also instructive is that XML proponents consider DBMSs as one kind of application. Failure to distinguish between a DBMS and an application program results in either database functions being implemented in application programs -- another violation of data independence -- or, as alluded to earlier, in DBMSs becoming "application-specific", e.g. "object DBMSs". It is not coincidental that some object terminology and concepts have crept into XML: both technologies originate in programming, not database management.

Unfortunately, the Internet generation of practitioners, many (if not most) of whom are self-taught, have primarily, if not solely, HTML and Java programming skills. They usually have not had formal exposure to data fundamentals in general and the relational model in particular. That is why, as already mentioned, they fail to make a proper distinction between a DBMS and an application. It is, therefore, not surprising that they perform data management in application programs. This requires some data model, XML provides one, and, having not been around to experience the complications of hierarchical and application-based data management, they adopt it (without realizing, of course, that this is what they're doing). As Jelliffe admits "having XML data would encourage programmers to directly use the element/attribute structure available" and they "may indeed decide to use it for coarse-grained queries which are culled [sic] on systems closer to the user", but claims that "whether the peripheral systems are built using relational concepts seems little to do with XML". Well, not exactly.

Note: Most XML inventors come from text processing and publishing. Jelliffe himself comes "from the background that a markup language is a software engineering technique and a kind of user interface. These are physical implementation issues concerned with distributed systems which can be largely independent of the underlying data management and the logical data model.").

As a physical format for data exchange, the effort and resources dedicated to XML data management by the industry and the directions in which it is being taken are simply hard to justify. It would be much more productive to dedeicate those resources to a true implementation of the relational model (For a new, radical implementation technology that allows this for the first time since the invention of the model, stay tuned to my Database Debunkings site). Even if one agrees that XML exchanges are more efficient than with other formats -- which is arguable at best -- performance is determined at the physical level by implementation factors (e.g. hardware, internal storage and access structures, network loads, etc.), not by the data model employed for logical representation and access (see Normalization and performance: Never the twain shall meet in this series and Denormalization for Performance: Et Tu Academia?).

A major problem in the information technology industry is that its practices make learning, using and integrating data and technologies extremely complex, inefficient and costly. This is due in large part to the flouting of sound scientific foundations, which tends to proliferate ad-hoc, narrowly targeted technologies that require complex never-ending "mappings," "translations," "bridging," and "integration efforts." Ironically, many of these technologies are created as "standards", purportedly to simplify and to maximize communication and integration, but the plethora of different, ad-hoc, often multiple standards achieves just the opposite. It was not necessary to invent a new technology just for the purpose of "transmitting" structured data. But having invented one, the industry should certainly not have relied on an inferior, discredited data model to base it on.

To get a sense of how XML unnecessarily reinvents the database wheel and of some of the underestimated practical complications thereof, see Overcome the Web Transaction Barrier, Modeling Relational Data in XML, and Untapped Promise of XML

About the Author

Fabian Pascal has an international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. His third book is called Practical issues in database management: A guide for the thinking practioner.

For More Information

Dig Deeper on Oracle XML

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.