|
Posted by Hugo Kornelis on 09/26/06 19:48
On 26 Sep 2006 08:18:09 -0700, tlyczko wrote:
>
>Hugo Kornelis wrote:
>
>> So short answer - don't use a default to signify missing data, use NULL.
>> And make sure that you understand three-valued logic and other quirks of
>> the behaviour of NULL.
>
>Can you suggest any good references on this topic??
>The ones I know of are based on Access, and I'm sure SS2005 is quite
>different!!!
Hi Tom,
Any good book on relational databases should do. I would be quite
surprised if there's any book on the market that fails to cover the
behaviour of NULL. (Except for books targetted at an experienced
audience, of course).
I'll give a very short explanation below.
>> really the most appropriate for a "missing value" marker. But I don't
>> think that you'll find anyone who would argue that '123-45-6789' will
>> ever behave more appropriate for "missing value" than NULL does.
>
>Wouldn't using a default value at least avoid the issues of using NULL
>per se??
No, you'd be doubly hit by the issues of getting correct results from a
database with missing values. All issues surrounding NULL are related to
defining how missing values should influence the outcome of expressions
and comparisons - you'd have the same issues with a default value that
represents "missing", plus you no longer have the database engine
handling them correctly by default. Oh, and you run the risk that one
day, you need to use the default value as a real value - then what?
Basic issues with NULL:
======================
1. NULL in expressions.
Since NULL represents a missing value, the DB engine doesn't know what
value should be there, or even IF there should be a value there. So if
asked to calculate 3 + NULL, the DB concludes that it can't computer the
result of adding three to a value that has not been given. The answer to
that question can't be given, so the result of 3 + NULL equates to NULL.
More in general - any expression with NULL as one of its operands will
always result in NULL. That holds true even for expressions that a human
could simplify - e.q. calculate 3 + A - A for A equal to NULL: you see
that the same value is added and subtracted again, so you'd know that
the answer is always three, regardless of the value of A. The DB will
first substitute symbols, ending up with 3 + NULL - NULL, i.e. start
with three, add an undisclosed amount, then subtract an undisclosed
amount. Now, the question is unanswerable - the result is NULL.
Note that it's not relevant if the SQL Server development team is or
isn't able to remove the redundant "+ A - A" before substitution - the
behaviour of NULL is defined in the ANSI and ISO standards for SQL, so
such an optimization would result in results that are incorrect as per
the definitions in the standards for SQL.
2. NULL in comparisons.
Please answer this: "Is Hugo Kornelis older than George W. Bush?" No, I
won't tell you my age or date of birth. <g>
You can't answer that question - and yet, that's the kind of questions
that relational databases have to be able to handle. This is handled by
introducing three-valued logic. Comparisons can not only result in True
or False, but also in Unknown. Any comparison that involves NULL on
either side will always result in Unknown (except the special comparison
predicates IS NULL and IS NOT NULL). This also resulted in expanded
truth tables for AND, OR and NOT: (used a fixed font - T, U, and F stand
for True, Unknowd and False respectively).
AND | T | U | F OR | T | U | F NOT
----+---+---+--- ---+---+---+--- ---+---
T | T | U | F T | T | T | T T | F
U | U | U | F U | T | U | U U | U
F | F | F | F F | T | U | F F | T
Many of the things I wrote for NULL in expressions hold true here as
well. Eg, a human would instantly know that "IF A = B OR A <> B" is
always true, regardless of the values of A and B, but if eiither (or
both) is NULL, the database will evaluate this condition to unknown. And
even "IF A <> A" will evaluate to Unknown if A is NULL.
In a WHERE or HAVING expression, the row or group is only included in
the result set if the expression evaluates to True - both False and
Unknown get rejected. On the other hand, a CHECK constraint will only
raise an exception if the condition evaluates to False; if it's either
True or Unknown, the data is allowed.
3. NOT IN with NULL
This point is actually a logical result of point 2 above, but it's such
a common error that I'll mention it anyway:
WHERE Table1.Column1 NOT IN (SELECT foo FROM bar)
This will never return any rows if at least one of the values for foo in
table bar is NULL. The reason is simple, if you think about it - let's
say that bar has four rows and the collection of foo values is {3, 7,
NULL, 2}. Now, what could you possibly answer to the question: "Is a
given value unequal to any of these four values: 3, 7, 2, and a fourth
value that I won't disclose?". You would know foor sure that it's NOT
unequal to any of the four if the given value is 3, 7, or 2. And for any
othher given value, the best answer you can give is "I don't know" -
since it MIGHT be equal to the fourth, undisclosed value.
So SQL Server will evaluate this to either False (for 3, 7, or 2) or to
Unknown - and in both cases, the row gets rejected.
The solution is to rewrite it with NOT EXISTS:
WHERE NOT EXISTS (SELECT * FROM bar WHERE bar.foo = Table1.Column1)
(Or, if you really want to use NOT IN, exclude NULL values:
WHERE Table1.Column1 NOT IN (SELECT foo FROM bar WHERE foo IS NOT NULL)
--
Hugo Kornelis, SQL Server MVP
Navigation:
[Reply to this message]
|