Stata | FAQ: Logical expressions and missing values

Home / Resources & support / FAQs / Logical expressions and missing values

Why is x > 1000 true when x contains missing values?

Title		Logical expressions and missing values
Author		William Gould, StataCorp

Stata codes missing values (., .a, .b, .c, ..., .z) larger than any nonmissing values, so, literally, x >1000 is true. This statement can lead to problems. Consider one of the following:

        . keep if x > 1000 

        . gen xbig = (x > 1000)

The first statement keeps all the observations for which x > 1000 or x is missing. The second statement creates xbig equal to 1 or 0, the value being 1 when x > 1000 or x is missing. This result is probably not what the user intends. The statements would be better written as

        . keep if x > 1000 & x<.

	. gen xbig = (x > 1000) if x<.

Why does Stata treat missing values in this way?

It is not possible with two-valued logic (True–False) to have missing values propagate through logical statements. A Boolean expression must ultimately evaluate to either true or false. Consider the statement

        . keep if z

What happens to the cases where z is missing? Do we keep them or not? To see the relevancy, pretend we redefined the way Stata handles missing values so that x > 1000 evaluated to missing when x was missing. Thus

        . gen xbig = (x > 1000)

would do what we want to do because x > 1000 could evaluate to missing. So, now consider

        . keep if x > 1000

What is keep to do when x is missing? x > 1000 evaluates to missing, which really means that we do not know whether x > 1000. Yet keep must either keep or drop each observation.

To resolve this issue, we are forced into a three-valued logic: true, false, or missing. We must now generalize all logical operators to the three values. We might do this as follows:

        and |  T  F  .        or |  T  F  .        not |  T  F  .
        ----+---------       ----+---------        ----+---------
         T  |  T  F  .        T  |  T  T  T         T  |  F  T  .
         F  |  F  F  F        F  |  T  F  .
         .  |  .  F  .        .  |  T  .  .

Further, let us say we decided an expression was executed only if the final resolution was T (and not F or missing). We have made generate behave as you would expect, but we introduced a notable inconsistency:

        drop if var1      IS NO LONGER THE SAME AS        keep if !var1

What we might ultimately need is a second set of comparison operators that are three-valued (but keep the current set of comparison operators) so that people can use the system that best suits a given statement.

All these statements can be made to work, but they are complicated and yield some surprising results (such as the drop/keep inconsistency shown above). We feel that most users—including ourselves—would find this more confusing than the system currently in place.

In the current system, you must be aware that missing values are coded and treated as positive infinity. Once this fact is absorbed, everything is consistent, drop and keep statements work as one would expect, and the logical comparisons make sense.

Changing to a three-valued logic might make some comparisons more predictable but will introduce inconsistencies elsewhere; that is, you would have to remember several rules for how missing values were handled in different situations instead of just one rule.

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Why is x > 1000 true when x contains missing values?

Why does Stata treat missing values in this way?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

Why is x > 1000 true when x contains missing values?

Why does Stata treat missing values in this way?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies