Three Valued Logic in Stata
3-8-2001, Rev 3-11-2001, 3-21-2001
David Kantor, Institute for Policy Studies
Johns Hopkins University, Baltimore MD
Presented at the North American Stata User Group (NASUG) meeting,
Boston, MA, March 12 & 13, 2001
The following contains some edits made after the presentation at NASUG.
----
This discussion is about logical, or "Boolean" values in expressions.
As you know, Stata does not have a Boolean type; it uses numbers to
do the job, as do some general-purpose programming languages such as C.
In a general-purpose programming that context, the use of numbers as
boolean quantities is usually satisfactory, but in Stata (or any data
analysis context) you have a complication that doesn't usually show up
in other programming contexts: missing values.
In using Stata over the past few years I often found myself constructing
logical expressions that build up from basic element, with varying degrees of
complexity, for example...
(1) a | b
(2) a | b | c | d | e
(3) a & (~b)
Sometimes the variables involved would include missing values, and it seemed
that something was wrong in the way Stata handled these. In this context,
Stata considers missing values as equivalent to true. Thus, in any of
examples 1, 2, or 3, if any of the operands are missing, the results are 1
(true). This didn't seem right; nor would it be right if the results were O
(false). It seemed that these situations yielded erroneous results and ought
to be excluded. In other words, the resulting value ought to be missing.
Thus, one might be tempted to write, to evaluate and assign the expression
in (2)...
(4) gen byte r3 = a | b | c | d | e if (a~=. & b~=. & c~=. & d~=. & e~=.)
(Note that the expression after the "if" does not include missing values in
its logical operands; it does not present a problem of the type
we are concerned with.)
But actually, this is too severe; I call it the Draconian solution.
It is not necessary to yield missing every time that any operand is
missing. Because of a certain property of the binary logical operators
(AND and OR), it is, in some situations, possible to get a definite result
even if some operands are missing. Thus we want to extend the traditional
operators to take advantage of this. The resulting system of values and
operators is known as Three-Valued Logic.
Three-Valued Logic is a small piece of a larger subject known as deviant
logics -- systems of logic other than the classical two-valued system.
The name Two-Valued logic refers to the two possible logic values: True and
False. Now we are introducing a third value: Missing (or Unknown or Maybe).
The people who are involved in the world of deviant logic are usually
concerned with more philosophical issues, such as the possibilities of
statements that are neither true nor false, or of different degrees of truth.
Here, we have a more practical approach. Fo now, we take the position that
the statements or the variables in question are either true or false, but
sometimes we have the misfortune of not knowing which it is.
This is the most conventional form of deviant logic; it almost doesn't
qualify as deviant.
As I mentioned above, the binary logical operators have a special property
that enables us to get nonmissing results from operands that are missing
(in some situations). This property is that a particular operand value
"flattens" the resulting value.
OR: one true operand yields true regardless of the value of the other.
AND: one false operand yields false regardless of the value of the other.
Thus, when evaluting an OR operation, if one operand is true, you don't need
to care what the other one is. The resulting value is TRUE regardless of
whether that other operand is TRUE or FALSE. Under our assumption of what a
missing Boolean value is, you would also take it that the result is TRUE,
even if that other operand is missing.
A similar situation shows up for AND with a FALSE operand.
(Note that this is the basis of short-circuit boolean evaluation -- a
feature often used by compilers and interpreters. The resulting value of a
Boolean expression can possibly be determined before the whole expression is
evaluated. Then the remainder of the evaluation is skipped.)
With this in mind, you can extend the operators to include missing as
an operand value, yielding nonmissing results whenever this flattening
property makes it possible. Another way to look at it is to ask,
"what if that missing value were true? -- what if it were false? -- would the
result be unambiguous?"
When you do this, you get the following operator tables. (0 and 1 are used
for representing false and true, and the binary operators are written
as two-way tables. We also extend the unary NOT operator.)
(5)
OR | 0 1 . AND | 0 1 . x NOT x
--+------- --+------- ---+---
0 | 0 1 . 0 | 0 0 0 0 | 1
1 | 1 1 1 1 | 0 1 . 1 | 0
. | . 1 . . | 0 . . . | .
Some basic facts about these operators:
1: AND and OR are both commutative.
2: AND and OR are both associative.
3: NOT is a self-inverse.
4: 1 (TRUE) is the identity element for AND
5: 0 (FALSE) is the identity element for OR.
6: DeMorgan's laws still hold.
7: AND distributes over OR.
8: OR distributes over AND.
I will not prove these presently. You can do them as excercises.
(I actually demonstrated some of them for my own benefit using the Stata egen
programs that I will mention shortly.)
All of these are properties of two-valued logic that are preserved when you
move to three-valued logic. (Under Stata's existing logical operators
(&, |, ~), Item 3 fails when missing values are present (or actually, when
anything besides 0 and 1 are used). The others still hold, but I disagree
with the resulting values in some cases.)
In particular, the fact that they are associative means that you can speak,
unambiguously, of the OR (or the AND) of several operands, just as you would
in two valued logic, and just as you would speak of the sum of several
numbers. (You can also speak of the OR of a single operand, which is the
operand itself, or of no operands: you get the identity element for that
operation.)
Since you can consider a multitude of operands, there is a simple way to
view the behavior of the binary operations:
The OR result is TRUE if any operand is TRUE. It is FALSE if all operands
are FALSE. It is missing if there is at least one missing operand and
all other operands are either FALSE or missing. More succinctly, it asks,
are any operands TRUE?
The AND result is FALSE if any operand is FALSE. It is TRUE if all operands
are TRUE. It is missing if there is at least one missing operand and
all other operands are either TRUE or missing. More succinctly, it asks,
are all operands TRUE?
So how do we implement these? It is actually very cumbersome to write
a Stata expression that embodies these. For those of you who are curious,
the general form of a three-valued OR operation is...
(6)
gen byte r1 = a | b if /*
*/ (a~=. & b ~=.) | (a==. & b ~=. & b~=0) | (b==. & a ~=. & a~=0)
(AND is similar, though a bit less verbose.) That's just for 2 operands;
there's no sense in trying to go to more operands.
After struggling with this a few times, it occurred to me that, rather than
trying to write expressions or sequences of gen and replace, I needed
a program. So I produced some programs, which subsequently evolved into
the following egen functions.
rtvor -- row-wise three-valued OR
rtvand -- row-wise three-valued AND
tvor -- column-wise three-valued OR
tvand -- column-wise three-valued AND
tvnot -- three-valued NOT
These are implemented in a set of five ado files (_grtvor, _grtvand, _gtvor,
_gtvand ,_gtvnot) available on the IDEAS web page inder the name TRINARY.
Note that for OR and AND, there are row- and column-wise versions. The row-
wise versions take zero or more variables. The column-wise versions gather
values of an expression, from one or more rows (taking a -by(varlist)- option).
There is only one type of NOT operation; that's all you can do with a unary
operation. They all adhere to the convention that, for input, 0 represents
false, any other nonmissing number represents true, and missing is its own
distinct category. For output, they yield 0, 1, and missing, and the default
type is byte.
Where would you use these? I use them to form indicator variables.
These can then be used in various kinds of analysis: summaries and tabulations
(including two-way), or more complex procedures: regressions, probits.
I would note that some people might ordinarily do a construct like...
(7) gen byte q = var1 == 3 & var2 == 5
where var1 and var2 are categorical variables that might be missing
or have values that signify missing. The above expression essentially
converts missing to false. The mean of q is the proportion of cases that
have the desired condition -- among ALL cases. This is erroneous; it is
really a lower bound on the desired proportion.
You could attempt to refine it with...
(8)
gen byte v1_3 = var1==3 if var1 ~=.
gen byte v2_5 = var2==5 if var2 ~=.
gen byte q = v1_3 & v2_5
(If var1 or var2 have actual values that signify missing, substitute them
for . in the -if- clauses.)
But this converts missing to true, which is also erroneous. Now the mean of
q would be an upper bound on the desired proportion. (Had we not included
the "if var1 ~=." and "var2 ~=." clauses, it would have been equivalent to
(7).) As I mentioned earlier, you can drop all cases where any missing
operands appear:
(9) gen byte q = v1_3 & v2_5 if v1_3 ~=. & v2_5 ~=.
or, equivalently,
gen byte q = var1==3 & var2 == 5 if var2 ~=. & var1 ~=.
But, that kills many usable cases. It is the Draconian solution that I
described earlier. The appropriate treatment is to use three-valued logic:
(10) egen q = rtvand(v1_3 v2_5)
(Note that to do this we needed indicator variables with approriate cases set
to missing, as spelled out in the first two lines of (8).)
Now, the mean of q has no erroneous contributions, and uses the maximal
set of contributing cases.
Note however, that this can yield surprising results. Suppose you have
100 observations; a is missing in 10 cases and b is missing in 10 cases,
with 4 of those cases in common. Then, depending on how the values
align, the three-valued result of an OR or an AND can have from 4 to 16
missing values. So your result might have more missing values than either of
the operands. The good news is that this range has a MAXIMUM of 16. Had you
used the draconian method as in (9), then you would certainly get 16 missing
values.
You may want to know what the use of the column-wise verions is. Suppose
you had data on people in families. You want to know in each family, whether
there is a child (someone with age < 18).
(11)
gen byte child = age<18 if age~=.
egen childfam = tvor(child), by(familyid)
----
Having established this, I want to mention that these operations are sometimes
too restrictive. In fact I call them the conservative AND and OR operations.
There are other possibly useful operations: liberal AND and OR. This scheme
eliminates the annoying property that I just mentioned -- that, even though
you are potentially doing better than the Draconian solution, you still can
get more missings than either of your operands. Sometimes you want a more
liberal protocol. You want to take the result from all the available
data, ignoring the missings. This notion makes more sense for a multitude
of operands, but is formally defined for two, and extended to many. To
me it makes sense for OR; I have not yet found a need for a liberal AND,
but it can be defined, and therefore I have programmed it. You are
essentially using the same principles found in egen-rsum or egen-rmean: take
a result from all available data; yield missing only if all inputs are
missing.
(Actually, if you restrict inputs to 0, 1, and.,
then liberal AND is equivalent to the Stata min function, and liberal OR is
equivalent to the Stata max function. There are column-wise versions, too
which, under this restriction, are equivalent to egen-min and egen-max.)
Another way to look at it is, you are summarizing all available data; AND
lets FALSE take priority; OR lets TRUE take priority.
But this is a different protocol that has a different application. The
conservative operations are for where the operands have specific roles
in the construct. The liberal operations are good for when you have multiple
instances of similar (and presumably correllated) data and you want to
condense them into one.
An example of a use for conservative operations:
(12) egen byte noramp = tvnot(hramp)
egen byte uramp= rtvand(nramp noramp)
This is an example from my work with the American Housing Survey. Here,
hramp indicates that the housing unit has a ramp. nramp indicates that the
occupant needs a ramp. The result uramp indicates an unmet need for a ramp.
An example of a use for liberal operations:
(13) egen disab = rtvor(disab82 disab83 disab84), liberal
Here, disab82 disab83 disab84 indicate disability in years 82, 83, & 84.
You want to boil this down to whether there is any indication of a disability
in any of those years. You use the liberal operation because you view the
multitude of inputs as just more data. You also expect high correlation;
if an operand is missing, its actual value is likely to be the same as that of
the other operands. Note that in a case where one operand is TRUE, you would
get the same result using conservative OR. The difference is in cases where
some operands are FALSE and some are missing. The conservative protocol says
that you can't make a determination. The liberal protocol says, all the
known values indicate false, and in this situation, that's good enough to
make a determination.
The tables for the liberal operations are...
(14)
ORlib | 0 1 . ANDlib | 0 1 .
--+------- --+-------
0 | 0 1 0 0 | 0 0 0
1 | 1 1 1 1 | 0 1 1
. | 0 1 . . | 0 1 .
Some basic facts about these operators:
1: They both commutative and associative
2: Missing is the identity element for both operations.
3: DeMorgan's laws still hold.
4: OR distributes over ORlib
5: ORlib distributes over OR
6: AND distributes over ANDlib
7: ANDlib distributes over AND
These operands are implemented in my egen functions, using a -liberal- option
as demonstrated in (13). (The versions of the programs that include this
option were installed in IDEAS as of 3-16-2001.)
----
When I began this discussion, I stated that in a logical context, Stata treats
missing as true, and then I developed this in the context of operations --
mostly the binary operations. I would like to take another look at this
feature. Consider this:
(15) summ var3 if a
and suppose that a has missing values. Those cases will be included in the
summary. Perhaps you meant...
(16) summ var3 if a & a~=.
or, if all your "true" values are coded 1,
summ var3 if a ==1
This should alert you to the notion that a logical entity may divide the
population into two classes or into three classes, depending on how it is
used -- or it may divide a population into three classes and you ignore the
third class. So you need to be careful about what you are doing. Suppose a
is a logical (0/1) variable.
(17) tab var3 a
(18) tab var3 if a
tab var3 if ~a
Ostensibly, you might expect that the two tables in (18) are just like
the two columns of the table in (17). But if a contains missings, then
there is a discrepancy; the first table of (18) includes those missings.
Of course, you can rewrite (18), but I find the discrepancy troublesome.
This leads into a notion that you may or may not have thought about,
but you probably have understood. There are two distinct situations in
which a logical entity can appear. (And this is about programming in general,
not just about Stata.) I call these assigned and controlling.
A assigned logical entity is the kind we spoke about earlier, such as q in
(10). You compute it and assign it to a variable for later reference or
analysis. A controlling logical entity is one that is the controlling
expression in a programming construct such as -if- or -while- or the Stata
-if- clause.
Of course, they are in a sense, interchangeable; you can compute a logical
value, store it and later use it in a controlling context, or you can use
an -if- statement to compute a value that you then assign. But the foregoing
discussion should alert you to how different they at "usage" time.
You can store any value. But what does it mean to have a missing
value in a controlling context? At run time, something must happen. Stata's
approach was to take it as true. They could have also made it equivalent
to false, but that's no more correct that the other. My own inclination
is that it should be an error condition. I think that if you have a missing
value in a controlling context, then either you made a mistake (and Stata
just goes on without any mention of it), or you are playing tricks, taking
advantage of a peculiar feature. (I am guilty of doing that on one occasion.)
In the latter case, you could have doe a better job of coding the variable.
(At the NASUG presentation, Bill Gould objected to this suggestion. He
agreed that there is a problem, but preferred a different approach.
I beileve that his idea would be that controlling expressions should be taken
as "known to be true" -- true and not missing. Accordingly,
tab var3 if a
would catch only the true cases. If you had a three-valued NOT opeation, then
tab var3 if NOT a
would catch only the false cases. The missings would be excluded from both.)
Other items to consider include what to do with the cond function.
You could...
(a) have it raise an error condition if the controlling expression is
missing, or
(b) have it yield missing if the controlling expression is missing, or
(c) give it an optional fourth section to indicate what to yield if
the controlling expression is missing. If the fourth section is not
present, then you would need to implement (a) or (b) in those cases.
Another issue to consider is how to treat the assert command, though I
have not prepared any suggestions for this.
Here are some other items for my wish list.
The three-valued operations should be built in to Stata. You would want,
at this point, separate operand symbols (such as "or", "and" and "not"),
rather than changing the behavior of existing symbols. (It was mentioned
at the meeting that "or" has a problem in that it is already in use in
several commands, signifying Odds Ratio. One solution would be to use
capital letters: OR, AND, NOT.)
You might or might not want to implement the liberal versions ("orlib",
"andlib"). For column-wise operations, you would still need the egen
functions.
As an alternative, you could introduce a Boolean type, as in Pascal.
Then you could overload the & | and ~ symbols. If their operands are
all booleans, then they would perform three-valued operations.
In either case, this would enable three-valued operations in complex
expressions. Note that in the present state of Stata, for an
expression such as
(19) a & (c | d)
the three-valued version could only be built up in stages, using my
egen function. The above suggestion would enable this expression (or some
equivalent, such as "a and (c or d)") to be evaluated directly -- either to
be assigned or to be used as a controlling expression. You would also be
able to use them in scalar assignments (in case you should have such a need).
The egen functions only apply to variables.
Finally, I would prefer that the relational operators respected missings.
I suspect that it is too late to change this, as some of us may make use of
the fact that missing is the maximal value in relational tests (I have done
this with an "ending date" of a time period, where missing signified
"no end yet") but it may be instructive to think about it. I think it would
have been more sensible to have expressions such as
(20) 3 < .
(21) 3 > .
to yield missing. Recall (11)
gen byte child = age<18 if age~=.
The -if- clause here would then be unnecessary.
If this were to be done, then you need to address the issue about
testing for equality with missing. That's one you might want to allow.
(But what about . <= . ?) Or to be consistent, you might want all relational
operations to yield missing if any operand is missing. Then you would need
a separate way to test whether something is missing. For example, a missing
function:
(22) if missing(b) ...
(23) gen byte q = missing(b)
or maybe a different operator such as a triple equal sign:
(24) if b===. ...
----
I wanted to include a discussion of DeMorgan's laws, but I realize that is
off the topic and there was be no time for it at the presentation.
The essential idea is that AND and OR are automorphisms of each other,
with NOT being the mapping between them. Furthermore, all the binary logical
operations discussed here (including the Liberal and Draconian versions) are
really automorphisms of the same one operation.
----
Conclusion
I imagine that the issues discussed here are simple compared to most of
your concerns regarding statistical analysis and the development of Stata.
But I find them to be fundamental and they were bypassed when Stata was
initially developed. I feel they demand serious attention.