Thursday, April 29, 2010

NULL Confusion

The SQL database offers a domain-independent NULL value. This is a terrible thing, and should be treated with a depth of respect and fear.

Before using NULL values in a database, read things like "Null Values in Fuzzy Datasets" and "Null Values Revisited in Prospect of Data Integration".

See this question, and the answers -- people are very, very confused by NULL. The issue is that the SQL NULL conflates several separate and unrelated concepts.
  • Not Available Now. We expect that the data will be discovered later. That is, this NULL is a work-around for a process issue.
  • Not Applicable (or Optional). This means that disjoint subtypes have been unified into a single table with optional or not-applicable attributes. This is an optimization choice. This also be due to state changes: the attribite is not used in one state, but will be used in another state.
With two meanings for a single value, hilarity always ensues.

Further Confusion

The NULL value doesn't participate in comparisons or indexes. This is -- apparently -- shocking to some people. Here's a nice summary in "Why NULL never compares false to anything in SQL". Also, some notes for Oracle users in "Understand Oracle null values" and "Oracle conditions and how they handle NULL"

Because of this "NULL doesn't compare" problem, people get baffled by the use of NVL (or IFNULL) functions.

The Rules of NULL

The first rule of NULL is "Don't." Don't define a data model that depends on nullability. Define a model where each class is distinct and all the attributes are required. This will lead to a number of focused, distinct class definitions. A large number. Get over it. Don't pre-optimize a design to reduce your number of classes or tables.

Once you have a model that makes sense -- one that you can prove will work -- one that precisely matches the semantics of your problem -- you can optimize. But don't start out by pre-optimizing or taking "obvious" short-cuts.

The second rule of NULL is "Don't conflate Availability with Applicability." If you have data that is not available, you may have serious issues in the process you're trying to automate. Often, this is because you have multiple views of a single entity. If you have data that's not applicable, you've done your design wrong -- you put disjoint subtypes into a single class definition. Factor them (for now) into class definitions where all attributes are required.

If you have inapplicable or unavailable data, you must factor things into pieces where all attributes are required. You will then find that your "thing with optional attributes" is either "thing that changes state" or "thing with multiple subsets of attributes that must be joined together." Later, you can think about optimizing.

"But," you object, "it's a single entity, I can't meaningfully decompose it."

Consider the root causes for missing data before you take this position too seriously.

Let's take a concrete example. We're doing econometric analysis of businesses. We have a "Business" entity that has various kinds of measurements from various sources. Not all sources of information have data on all businesses. There's a lot of "not available" data. Also, depending on the type of business, there may be a certain amount of "not applicable" data. (For example, not-for-profit corporations, privately held companies and public companies all have different kinds of reporting requirements.)

What you have is a core "Business Name" entity with the minimal information. Name; maybe other identifying information. But often, the name is all you know.

You have a "Business Address" entity with mailing address. Small businesses may have only one of these. Large businesses will have more than one.

You have "Econometric Scoring From Source X" which may have a normalized name, a different address and other scores.

Conceptually, this is a single entity, a business. But from three different points of view. Initially, this is not a single class definition. It may be optimized into a single table with NULLs. That's a lousy design, but very popular. There are multiple addresses; scores change over time. Implementing this as a "single-table-with-nulls" seems to be a bad policy.

The third rule of NULL is "Only as an optimization." If you can prove that a particular join is nearly one-to-one, and you can prove that the cost of the join is too high, then you can consider pre-joining and using NULLs.

Common Mistakes

There are two common NULL-join mistakes: optional joins and date ranges.

One common mistake is attempting to join on an "optional" attribute. You wind up with NVL functions in the WHERE clause. RED ALERT.

If you have NVL functions in a WHERE clause then (1) you've defeated all indexing and (2) you've reinvented the wheel.

An NVL to a where clause is stand-in for a UNION. When you think you need an NVL, you've got two subsets that you're tying to put together: the subset with a non-NULL value and the subset with a NULL value. This is a UNION, and the UNION will probably be faster than your NVL construct. (If it isn't, good on you for benchmarking.)

Another common mistake is date ranges. Folks insist on having "open-ended" date ranges where the "end-of-time" is represented as a NULL. RED ALERT. NULL already means not applicable or not available. The end of time is both applicable and available. Don't add a new meaning to NULL.

Coding the end of time as NULL is simply wrong. The end of time (for now) is 12-21-2112. It's an easy date to remember. It's a cooler date in Europe than the US.

"But," the deeply argumentative claim, "I can't have my application dependent on a mystery constant like that". Lighten up. First, your app won't be running in 2112. Second, your app is full of mystery constants like that. You've got codes, your users have codes they think are important. Your localtime offset from GMT is a mystery constant. The start and end dates for daylight savings time are mystery constants. Please. You have lots of mystery constants. If you want to be "transparent", make it a configuration parameter.

Just code the end-of-time as a real date not a NULL and use ordinary date-range logic with no silly NVL business.