Why is x > 1000 true when x contains missing value?
|
Title
|
|
Logical expressions and missing values
|
|
Author
|
William Gould, StataCorp
|
|
Date
|
March 1997; updated February 2003
|
Stata codes missing values (., .a, .b, .c, ...,
.z) larger than any nonmissing values, so, literally, x >1000
is true. This statement can lead to problems. Consider one of the
following:
. keep if x > 1000
. gen xbig = (x > 1000)
The first statement keeps all the observations for which x > 1000
or x is missing. The second statement creates xbig equal to 1
or 0, the value being 1 when x > 1000 or x is missing.
This result is probably not what the user intends. The statements would be
better written as
. keep if x > 1000 & x<.
. gen xbig = (x > 1000) if x<.
Why does Stata treat missing values in this way?
It is not possible with two-valued logic (True–False) to have missing
values propagate through logical statements. A Boolean expression must
ultimately evaluate to either true or false. Consider the statement
. keep if z
What happens to the cases where z is missing? Do we keep them or
not? To see the relevancy, pretend we redefined the way Stata handles
missing values so that x > 1000 evaluated to missing when x
was missing. Thus
. gen xbig = (x > 1000)
would do what we want to do because x > 1000 could evaluate to
missing. So, now consider
. keep if x > 1000
What is keep to do when x is missing? x > 1000
evaluates to missing, which really means that we do not know whether x
> 1000. Yet keep must either keep or drop each observation.
To resolve this issue, we are forced into a three-valued logic: true, false,
or missing. We must now generalize all logical operators to the three
values. We might do this as follows:
and | T F . or | T F . not | T F .
----+--------- ----+--------- ----+---------
T | T F . T | T T T T | F T .
F | F F F F | T F .
. | . F . . | T . .
Further, let us say we decided an expression was executed only if the
final resolution was T (and not F or missing). We have made generate behave
as you would expect, but we introduced a notable inconsistency:
drop if var1 IS NO LONGER THE SAME AS keep if !var1
What we might ultimately need is a second set of comparison operators that
are three-valued (but keep the current set of comparison operators) so that
people can use the system that best suits a given statement.
All these statements can be made to work, but they are complicated and yield
some surprising results (such as the drop/keep inconsistency shown above).
We feel that most users—including ourselves—would find this more
confusing than the system currently in place.
In the current system, you must be aware that missing values are coded and
treated as positive infinity. Once this fact is absorbed, everything is
consistent, drop and keep statements work as one would expect, and the
logical comparisons make sense.
Changing to a three-valued logic might make some comparisons more predictable
but will introduce inconsistencies elsewhere; that is, you
would have to remember several rules for how missing values were handled in
different situations instead of just one rule.
|