One of thorniest aspects of database management, and one that is (for good reasons) least understood, is missing data. This article is a brief response to some online exchanges on SQL's specific implementation of missing data: NULLs. For a more in depth -- and what I believe to be (at least for now!) definitive -- treatment and the reasoning behind the arguments in this article see Chapter 10, "What you don't know can hurt you: Missing information" in
In my opinion, there is no reason to avoid NULL columns [sic] as long as the database designer (modeler) and the people writing the code understand what a NULL means. In formal relational theory, there are really two generic kinds of NULLs, with some variations that can occur in certain data types. One type of NULL means that the value isn't known or knowable. It implies uncertainty or inability to measure a value. The other type of NULL means the data isn't available. It could be measured, but at the point the data was recorded or entered, the value wasn't known. For the purpose of simplicity, both of these NULL values were combined into a single value in SQL-92, which is the dominant form of SQL today. In the sense that both of these types of NULL values boil down to "I don't know," this is a reasonable compromise to KISS. There are other forms of NULL values which are much more specific. They involve things like the NAN values of real numbers, and the (multiple) Boolean forms of uncertainty. These are limited in scope, and rarely if ever used even if they are available in an SQL implementation. -- Online exchange
Nulls are not bad unless you try to use them to mean something in a table (jab, jab). A null by definition is the absence of data. In a relational database, a null value in a table constitutes an entity conglomerate and should therefore be normalized to a child table which would only hold the "missing" values. In a "fully" normalized database, nulls would only appear in views and selects. For more information on this and other often misunderstood database design issues, read the fabulous book Practical issues in database management by Fabian Pascal. --Online advice
But does it make sense to speak of "understanding what nulls mean"? The problem with SQL's NULLs is that they do not represent anything in the real world -- so what, then, do they mean, if anything? This problem is reflected in the term "null value," which is a contradiction in terms: a NULL marks the absence of a value, so it is not a value! It is for this reason that the NULL approach is inherently complex, counterintuitive, prone to errors and, thus, badly implemented to boot. What is more, it yields results that are incorrect in the real world. Take, for example, vendors:
The SQL Language Reference manual contradicts itself right away with the definition of NULL. Null is not equivalent to zero or to blank... Empty strings have a null value...In subsequent section "Nulls and search conditions," the manual doesn't distinguish NULL and zero-length string. It goes on to explain that SELECT ... WHERE x IS NULL returns null rows as well as empty strings. The problem is, the converse is not true. Such a statement is guaranteed to return nothing: SELECT ... WHERE x = "". This is a very nasty exception! Some further confusion in the manual can be found in the following: If any item in an expression contains a null value, then the result of evaluating the expression is null. However, this doesn't seem to apply to NULL string. These expressions simply treat NULL as empty strings, @length(x)'['||x||']'. Furthermore, x IS NULL never results in null! It seems the vendor has major confusion on NULL. --Online message
I'm having difficulty creating a table with one of the columns in a composite primary key being NULL[able]. Example: table ABC with columns in the PK ' A' NOT NULL, ' B' NOT NULL and ' C' NULL. Logically, I have a valid business reason for wanting to implement a table with at least one column of the compound PK being nullable. --Online message
Note very carefully that any "valid business reason" for "one column of a compound PK being nullable" is also a contradiction in terms: it conflicts with the key constraint -- uniqueness: it is not possible to ensure PK value uniqueness if values are, either wholly, or partially unknown. The choice of columns with unknown values as PKs indicates serious conceptual and logical confusion and, therefore, bad database design.
Even though E.F. Codd, the inventor of the relational model, added two kinds of mark -- A-MARK for "value applicable but missing" and I-MARK for "value inapplicable" -- to the original model, NULL is IBM's specific implementation of his concepts in SQL. NULLs can represent both missing and inapplicable values (an example of making things simpler than the simplest possible). The problem is that relational theory actually breaks down in the presence of NULLs, because it requires that all information in a relational database should be represented explicitly and in only one way: as values in tables. And because NULLs are not values, tables with NULLs are not R-tables, relational operators fail with NULLs (should a join include or exclude rows with NULLs in join columns?), as do integrity enforcement (does a row violate constraints that apply to columns which contain NULLs?), normalization does not work, and so on. In other words, NULLs are outside theory, where correctness cannot be guaranteed. To say, therefore, that "there is no reason to avoid NULL[s]" is very misleading and quite irresponsible. The extract from the vendor's manual and and the user comments above, which are quite representative, make this abundantly clear why. The relational model is just an application of predicate logic to database management, which is predicated on two-valued logic (2VL): in the real world propositions are either true or false, whether we know which it is or not. What we don't know we simply cannot assert for a fact, no matter how badly we need or want to -- the only way to make something assertable is to know it. No workaround -- be it NULLs or anything else -- can work around this and will produce complications and incorrect or hard to interpret query results (see above mentioned Chapter 10).
Suppose that the salary amount for some employees is unknown. In Figure 1A, rows representing those employees have missing SALARY values. Note that SALARY values do exist in the real world for all employees, it's just that some of the amounts are unknown and therefore missing. Consider now Figure 1B: a table with all nonkey values unknown. Few would dispute that it is informationally useless to record such "full nothings" in databases, yet "partial nothings" such as those in Figure 1A are deemed useful in some way. The argument seems to be that even though some salary amounts are unknown, it is nevertheless known that all employees do earn some salary. Note very carefully, though, that what is assertable, then, is not the salary amount, but the existence of a salary. Thus, only four values can be asserted for all employees: employee number, name, department, hire date and existence of a salary; the salary amount can be asserted only for employees with known salaries. In other words, there are two kinds of assertions here: propositions about all employees and propositions about employees with known salaries.
The correct database representation is in Figure 2. Note that it is devoid of missing values. Chapter 10 in the book explains the implications for database integrity and manipulation and for DBMS implementation.
About the author
Fabian Pascal has a national and international reputation as an independent technology analyst, consultant, author and lecturer specializing in data management. He was affiliated with Codd & Date and for more than 15 years held various analytical and management positions in the private and public sectors, has taught and lectured at the business and academic levels, and advised vendor and user organizations on database technology, strategy and implementation. Clients include IBM, Census Bureau, CIA, Apple, Borland, Cognos, UCSF, IRS. He is founder and editor of Database Debunkings, a web site dedicated to dispelling prevailing fallacies and misconceptions in the database industry, where C.J. Date is a senior contributor. He has contributed extensively to most trade publications, including Database Programming and Design, DBMS, DataBased Advisor, Byte, Infoworld and Computerworld. His third book, "Practical issues in database management" (Addison Wesley, June 2000), serves as text for a seminar bearing the same name. He can be contacted at firstname.lastname@example.org.
For More Information
- What do you think about this column? E-mail the Editor at email@example.com with your feedback.
- The Best Web Links on XML and databases
- Post your technical questions--or help out your peers by answering them--in our live discussion forums.
- Ask the Experts! Our database design, SQL, Oracle, DB2, and SQL Server gurus will answer your toughest questions.
This was first published in November 2001