Home  /  Resources & support  /  FAQs  /  Logical expressions and missing values

Why is x > 1000 true when x contains missing values?

Title   Logical expressions and missing values
Author William Gould, StataCorp

Stata codes missing values (., .a, .b, .c, ..., .z) larger than any nonmissing values, so, literally, x >1000 is true. This statement can lead to problems. Consider one of the following:

        . keep if x > 1000 

        . gen xbig = (x > 1000)

The first statement keeps all the observations for which x > 1000 or x is missing. The second statement creates xbig equal to 1 or 0, the value being 1 when x > 1000 or x is missing. This result is probably not what the user intends. The statements would be better written as

        . keep if x > 1000 & x<.

	. gen xbig = (x > 1000) if x<.

Why does Stata treat missing values in this way?

It is not possible with two-valued logic (True–False) to have missing values propagate through logical statements. A Boolean expression must ultimately evaluate to either true or false. Consider the statement

        . keep if z

What happens to the cases where z is missing? Do we keep them or not? To see the relevancy, pretend we redefined the way Stata handles missing values so that x > 1000 evaluated to missing when x was missing. Thus

        . gen xbig = (x > 1000)

would do what we want to do because x > 1000 could evaluate to missing. So, now consider

        . keep if x > 1000

What is keep to do when x is missing? x > 1000 evaluates to missing, which really means that we do not know whether x > 1000. Yet keep must either keep or drop each observation.

To resolve this issue, we are forced into a three-valued logic: true, false, or missing. We must now generalize all logical operators to the three values. We might do this as follows:

        and |  T  F  .        or |  T  F  .        not |  T  F  .
        ----+---------       ----+---------        ----+---------
         T  |  T  F  .        T  |  T  T  T         T  |  F  T  .
         F  |  F  F  F        F  |  T  F  .
         .  |  .  F  .        .  |  T  .  .

Further, let us say we decided an expression was executed only if the final resolution was T (and not F or missing). We have made generate behave as you would expect, but we introduced a notable inconsistency:

        drop if var1      IS NO LONGER THE SAME AS        keep if !var1

What we might ultimately need is a second set of comparison operators that are three-valued (but keep the current set of comparison operators) so that people can use the system that best suits a given statement.

All these statements can be made to work, but they are complicated and yield some surprising results (such as the drop/keep inconsistency shown above). We feel that most users—including ourselves—would find this more confusing than the system currently in place.

In the current system, you must be aware that missing values are coded and treated as positive infinity. Once this fact is absorbed, everything is consistent, drop and keep statements work as one would expect, and the logical comparisons make sense.

Changing to a three-valued logic might make some comparisons more predictable but will introduce inconsistencies elsewhere; that is, you would have to remember several rules for how missing values were handled in different situations instead of just one rule.