Title | Logical expressions and missing values | |

Author | William Gould, StataCorp | |

Date | March 1997; updated February 2003 |

Stata codes missing values (**.**, **.a**, **.b**, **.c**, ...,
**.z**) larger than any nonmissing values, so, literally, **x >1000**
is true. This statement can lead to problems. Consider one of the
following:

. keep if x > 1000 . gen xbig = (x > 1000)

The first statement keeps all the observations for which **x > 1000**
or **x** is missing. The second statement creates **xbig** equal to 1
or 0, the value being 1 when **x > 1000** or **x** is missing.
This result is probably not what the user intends. The statements would be
better written as

. keep if x > 1000 & x<. . gen xbig = (x > 1000) if x<.

It is not possible with two-valued logic (True–False) to have missing values propagate through logical statements. A Boolean expression must ultimately evaluate to either true or false. Consider the statement

. keep if z

What happens to the cases where **z** is missing? Do we keep them or
not? To see the relevancy, pretend we redefined the way Stata handles
missing values so that **x > 1000** evaluated to missing when **x**
was missing. Thus

. gen xbig = (x > 1000)

would do what we want to do because **x > 1000** could evaluate to
missing. So, now consider

. keep if x > 1000

What is **keep** to do when **x** is missing? **x > 1000**
evaluates to missing, which really means that we do not know whether **x
> 1000**. Yet **keep** must either keep or drop each observation.

To resolve this issue, we are forced into a three-valued logic: true, false, or missing. We must now generalize all logical operators to the three values. We might do this as follows:

and | T F . or | T F . not | T F . ----+--------- ----+--------- ----+--------- T | T F . T | T T T T | F T . F | F F F F | T F . . | . F . . | T . .

Further, let us say we decided an expression was executed only if the final resolution was T (and not F or missing). We have made generate behave as you would expect, but we introduced a notable inconsistency:

drop if var1IS NO LONGER THE SAME ASkeep if !var1

What we might ultimately need is a second set of comparison operators that are three-valued (but keep the current set of comparison operators) so that people can use the system that best suits a given statement.

All these statements can be made to work, but they are complicated and yield some surprising results (such as the drop/keep inconsistency shown above). We feel that most users—including ourselves—would find this more confusing than the system currently in place.

In the current system, you must be aware that missing values are coded and treated as positive infinity. Once this fact is absorbed, everything is consistent, drop and keep statements work as one would expect, and the logical comparisons make sense.

Changing to a three-valued logic might make some comparisons more predictable but will introduce inconsistencies elsewhere; that is, you would have to remember several rules for how missing values were handled in different situations instead of just one rule.