Three Valued Logic in Stata 3-8-2001, Rev 3-11-2001, 3-21-2001 David Kantor, Institute for Policy Studies Johns Hopkins University, Baltimore MD Presented at the North American Stata User Group (NASUG) meeting, Boston, MA, March 12 & 13, 2001 The following contains some edits made after the presentation at NASUG. ---- This discussion is about logical, or "Boolean" values in expressions. As you know, Stata does not have a Boolean type; it uses numbers to do the job, as do some general-purpose programming languages such as C. In a general-purpose programming that context, the use of numbers as boolean quantities is usually satisfactory, but in Stata (or any data analysis context) you have a complication that doesn't usually show up in other programming contexts: missing values. In using Stata over the past few years I often found myself constructing logical expressions that build up from basic element, with varying degrees of complexity, for example... (1) a | b (2) a | b | c | d | e (3) a & (~b) Sometimes the variables involved would include missing values, and it seemed that something was wrong in the way Stata handled these. In this context, Stata considers missing values as equivalent to true. Thus, in any of examples 1, 2, or 3, if any of the operands are missing, the results are 1 (true). This didn't seem right; nor would it be right if the results were O (false). It seemed that these situations yielded erroneous results and ought to be excluded. In other words, the resulting value ought to be missing. Thus, one might be tempted to write, to evaluate and assign the expression in (2)... (4) gen byte r3 = a | b | c | d | e if (a~=. & b~=. & c~=. & d~=. & e~=.) (Note that the expression after the "if" does not include missing values in its logical operands; it does not present a problem of the type we are concerned with.) But actually, this is too severe; I call it the Draconian solution. It is not necessary to yield missing every time that any operand is missing. Because of a certain property of the binary logical operators (AND and OR), it is, in some situations, possible to get a definite result even if some operands are missing. Thus we want to extend the traditional operators to take advantage of this. The resulting system of values and operators is known as Three-Valued Logic. Three-Valued Logic is a small piece of a larger subject known as deviant logics -- systems of logic other than the classical two-valued system. The name Two-Valued logic refers to the two possible logic values: True and False. Now we are introducing a third value: Missing (or Unknown or Maybe). The people who are involved in the world of deviant logic are usually concerned with more philosophical issues, such as the possibilities of statements that are neither true nor false, or of different degrees of truth. Here, we have a more practical approach. Fo now, we take the position that the statements or the variables in question are either true or false, but sometimes we have the misfortune of not knowing which it is. This is the most conventional form of deviant logic; it almost doesn't qualify as deviant. As I mentioned above, the binary logical operators have a special property that enables us to get nonmissing results from operands that are missing (in some situations). This property is that a particular operand value "flattens" the resulting value. OR: one true operand yields true regardless of the value of the other. AND: one false operand yields false regardless of the value of the other. Thus, when evaluting an OR operation, if one operand is true, you don't need to care what the other one is. The resulting value is TRUE regardless of whether that other operand is TRUE or FALSE. Under our assumption of what a missing Boolean value is, you would also take it that the result is TRUE, even if that other operand is missing. A similar situation shows up for AND with a FALSE operand. (Note that this is the basis of short-circuit boolean evaluation -- a feature often used by compilers and interpreters. The resulting value of a Boolean expression can possibly be determined before the whole expression is evaluated. Then the remainder of the evaluation is skipped.) With this in mind, you can extend the operators to include missing as an operand value, yielding nonmissing results whenever this flattening property makes it possible. Another way to look at it is to ask, "what if that missing value were true? -- what if it were false? -- would the result be unambiguous?" When you do this, you get the following operator tables. (0 and 1 are used for representing false and true, and the binary operators are written as two-way tables. We also extend the unary NOT operator.) (5) OR | 0 1 . AND | 0 1 . x NOT x --+------- --+------- ---+--- 0 | 0 1 . 0 | 0 0 0 0 | 1 1 | 1 1 1 1 | 0 1 . 1 | 0 . | . 1 . . | 0 . . . | . Some basic facts about these operators: 1: AND and OR are both commutative. 2: AND and OR are both associative. 3: NOT is a self-inverse. 4: 1 (TRUE) is the identity element for AND 5: 0 (FALSE) is the identity element for OR. 6: DeMorgan's laws still hold. 7: AND distributes over OR. 8: OR distributes over AND. I will not prove these presently. You can do them as excercises. (I actually demonstrated some of them for my own benefit using the Stata egen programs that I will mention shortly.) All of these are properties of two-valued logic that are preserved when you move to three-valued logic. (Under Stata's existing logical operators (&, |, ~), Item 3 fails when missing values are present (or actually, when anything besides 0 and 1 are used). The others still hold, but I disagree with the resulting values in some cases.) In particular, the fact that they are associative means that you can speak, unambiguously, of the OR (or the AND) of several operands, just as you would in two valued logic, and just as you would speak of the sum of several numbers. (You can also speak of the OR of a single operand, which is the operand itself, or of no operands: you get the identity element for that operation.) Since you can consider a multitude of operands, there is a simple way to view the behavior of the binary operations: The OR result is TRUE if any operand is TRUE. It is FALSE if all operands are FALSE. It is missing if there is at least one missing operand and all other operands are either FALSE or missing. More succinctly, it asks, are any operands TRUE? The AND result is FALSE if any operand is FALSE. It is TRUE if all operands are TRUE. It is missing if there is at least one missing operand and all other operands are either TRUE or missing. More succinctly, it asks, are all operands TRUE? So how do we implement these? It is actually very cumbersome to write a Stata expression that embodies these. For those of you who are curious, the general form of a three-valued OR operation is... (6) gen byte r1 = a | b if /* */ (a~=. & b ~=.) | (a==. & b ~=. & b~=0) | (b==. & a ~=. & a~=0) (AND is similar, though a bit less verbose.) That's just for 2 operands; there's no sense in trying to go to more operands. After struggling with this a few times, it occurred to me that, rather than trying to write expressions or sequences of gen and replace, I needed a program. So I produced some programs, which subsequently evolved into the following egen functions. rtvor -- row-wise three-valued OR rtvand -- row-wise three-valued AND tvor -- column-wise three-valued OR tvand -- column-wise three-valued AND tvnot -- three-valued NOT These are implemented in a set of five ado files (_grtvor, _grtvand, _gtvor, _gtvand ,_gtvnot) available on the IDEAS web page inder the name TRINARY. Note that for OR and AND, there are row- and column-wise versions. The row- wise versions take zero or more variables. The column-wise versions gather values of an expression, from one or more rows (taking a -by(varlist)- option). There is only one type of NOT operation; that's all you can do with a unary operation. They all adhere to the convention that, for input, 0 represents false, any other nonmissing number represents true, and missing is its own distinct category. For output, they yield 0, 1, and missing, and the default type is byte. Where would you use these? I use them to form indicator variables. These can then be used in various kinds of analysis: summaries and tabulations (including two-way), or more complex procedures: regressions, probits. I would note that some people might ordinarily do a construct like... (7) gen byte q = var1 == 3 & var2 == 5 where var1 and var2 are categorical variables that might be missing or have values that signify missing. The above expression essentially converts missing to false. The mean of q is the proportion of cases that have the desired condition -- among ALL cases. This is erroneous; it is really a lower bound on the desired proportion. You could attempt to refine it with... (8) gen byte v1_3 = var1==3 if var1 ~=. gen byte v2_5 = var2==5 if var2 ~=. gen byte q = v1_3 & v2_5 (If var1 or var2 have actual values that signify missing, substitute them for . in the -if- clauses.) But this converts missing to true, which is also erroneous. Now the mean of q would be an upper bound on the desired proportion. (Had we not included the "if var1 ~=." and "var2 ~=." clauses, it would have been equivalent to (7).) As I mentioned earlier, you can drop all cases where any missing operands appear: (9) gen byte q = v1_3 & v2_5 if v1_3 ~=. & v2_5 ~=. or, equivalently, gen byte q = var1==3 & var2 == 5 if var2 ~=. & var1 ~=. But, that kills many usable cases. It is the Draconian solution that I described earlier. The appropriate treatment is to use three-valued logic: (10) egen q = rtvand(v1_3 v2_5) (Note that to do this we needed indicator variables with approriate cases set to missing, as spelled out in the first two lines of (8).) Now, the mean of q has no erroneous contributions, and uses the maximal set of contributing cases. Note however, that this can yield surprising results. Suppose you have 100 observations; a is missing in 10 cases and b is missing in 10 cases, with 4 of those cases in common. Then, depending on how the values align, the three-valued result of an OR or an AND can have from 4 to 16 missing values. So your result might have more missing values than either of the operands. The good news is that this range has a MAXIMUM of 16. Had you used the draconian method as in (9), then you would certainly get 16 missing values. You may want to know what the use of the column-wise verions is. Suppose you had data on people in families. You want to know in each family, whether there is a child (someone with age < 18). (11) gen byte child = age<18 if age~=. egen childfam = tvor(child), by(familyid) ---- Having established this, I want to mention that these operations are sometimes too restrictive. In fact I call them the conservative AND and OR operations. There are other possibly useful operations: liberal AND and OR. This scheme eliminates the annoying property that I just mentioned -- that, even though you are potentially doing better than the Draconian solution, you still can get more missings than either of your operands. Sometimes you want a more liberal protocol. You want to take the result from all the available data, ignoring the missings. This notion makes more sense for a multitude of operands, but is formally defined for two, and extended to many. To me it makes sense for OR; I have not yet found a need for a liberal AND, but it can be defined, and therefore I have programmed it. You are essentially using the same principles found in egen-rsum or egen-rmean: take a result from all available data; yield missing only if all inputs are missing. (Actually, if you restrict inputs to 0, 1, and., then liberal AND is equivalent to the Stata min function, and liberal OR is equivalent to the Stata max function. There are column-wise versions, too which, under this restriction, are equivalent to egen-min and egen-max.) Another way to look at it is, you are summarizing all available data; AND lets FALSE take priority; OR lets TRUE take priority. But this is a different protocol that has a different application. The conservative operations are for where the operands have specific roles in the construct. The liberal operations are good for when you have multiple instances of similar (and presumably correllated) data and you want to condense them into one. An example of a use for conservative operations: (12) egen byte noramp = tvnot(hramp) egen byte uramp= rtvand(nramp noramp) This is an example from my work with the American Housing Survey. Here, hramp indicates that the housing unit has a ramp. nramp indicates that the occupant needs a ramp. The result uramp indicates an unmet need for a ramp. An example of a use for liberal operations: (13) egen disab = rtvor(disab82 disab83 disab84), liberal Here, disab82 disab83 disab84 indicate disability in years 82, 83, & 84. You want to boil this down to whether there is any indication of a disability in any of those years. You use the liberal operation because you view the multitude of inputs as just more data. You also expect high correlation; if an operand is missing, its actual value is likely to be the same as that of the other operands. Note that in a case where one operand is TRUE, you would get the same result using conservative OR. The difference is in cases where some operands are FALSE and some are missing. The conservative protocol says that you can't make a determination. The liberal protocol says, all the known values indicate false, and in this situation, that's good enough to make a determination. The tables for the liberal operations are... (14) ORlib | 0 1 . ANDlib | 0 1 . --+------- --+------- 0 | 0 1 0 0 | 0 0 0 1 | 1 1 1 1 | 0 1 1 . | 0 1 . . | 0 1 . Some basic facts about these operators: 1: They both commutative and associative 2: Missing is the identity element for both operations. 3: DeMorgan's laws still hold. 4: OR distributes over ORlib 5: ORlib distributes over OR 6: AND distributes over ANDlib 7: ANDlib distributes over AND These operands are implemented in my egen functions, using a -liberal- option as demonstrated in (13). (The versions of the programs that include this option were installed in IDEAS as of 3-16-2001.) ---- When I began this discussion, I stated that in a logical context, Stata treats missing as true, and then I developed this in the context of operations -- mostly the binary operations. I would like to take another look at this feature. Consider this: (15) summ var3 if a and suppose that a has missing values. Those cases will be included in the summary. Perhaps you meant... (16) summ var3 if a & a~=. or, if all your "true" values are coded 1, summ var3 if a ==1 This should alert you to the notion that a logical entity may divide the population into two classes or into three classes, depending on how it is used -- or it may divide a population into three classes and you ignore the third class. So you need to be careful about what you are doing. Suppose a is a logical (0/1) variable. (17) tab var3 a (18) tab var3 if a tab var3 if ~a Ostensibly, you might expect that the two tables in (18) are just like the two columns of the table in (17). But if a contains missings, then there is a discrepancy; the first table of (18) includes those missings. Of course, you can rewrite (18), but I find the discrepancy troublesome. This leads into a notion that you may or may not have thought about, but you probably have understood. There are two distinct situations in which a logical entity can appear. (And this is about programming in general, not just about Stata.) I call these assigned and controlling. A assigned logical entity is the kind we spoke about earlier, such as q in (10). You compute it and assign it to a variable for later reference or analysis. A controlling logical entity is one that is the controlling expression in a programming construct such as -if- or -while- or the Stata -if- clause. Of course, they are in a sense, interchangeable; you can compute a logical value, store it and later use it in a controlling context, or you can use an -if- statement to compute a value that you then assign. But the foregoing discussion should alert you to how different they at "usage" time. You can store any value. But what does it mean to have a missing value in a controlling context? At run time, something must happen. Stata's approach was to take it as true. They could have also made it equivalent to false, but that's no more correct that the other. My own inclination is that it should be an error condition. I think that if you have a missing value in a controlling context, then either you made a mistake (and Stata just goes on without any mention of it), or you are playing tricks, taking advantage of a peculiar feature. (I am guilty of doing that on one occasion.) In the latter case, you could have doe a better job of coding the variable. (At the NASUG presentation, Bill Gould objected to this suggestion. He agreed that there is a problem, but preferred a different approach. I beileve that his idea would be that controlling expressions should be taken as "known to be true" -- true and not missing. Accordingly, tab var3 if a would catch only the true cases. If you had a three-valued NOT opeation, then tab var3 if NOT a would catch only the false cases. The missings would be excluded from both.) Other items to consider include what to do with the cond function. You could... (a) have it raise an error condition if the controlling expression is missing, or (b) have it yield missing if the controlling expression is missing, or (c) give it an optional fourth section to indicate what to yield if the controlling expression is missing. If the fourth section is not present, then you would need to implement (a) or (b) in those cases. Another issue to consider is how to treat the assert command, though I have not prepared any suggestions for this. Here are some other items for my wish list. The three-valued operations should be built in to Stata. You would want, at this point, separate operand symbols (such as "or", "and" and "not"), rather than changing the behavior of existing symbols. (It was mentioned at the meeting that "or" has a problem in that it is already in use in several commands, signifying Odds Ratio. One solution would be to use capital letters: OR, AND, NOT.) You might or might not want to implement the liberal versions ("orlib", "andlib"). For column-wise operations, you would still need the egen functions. As an alternative, you could introduce a Boolean type, as in Pascal. Then you could overload the & | and ~ symbols. If their operands are all booleans, then they would perform three-valued operations. In either case, this would enable three-valued operations in complex expressions. Note that in the present state of Stata, for an expression such as (19) a & (c | d) the three-valued version could only be built up in stages, using my egen function. The above suggestion would enable this expression (or some equivalent, such as "a and (c or d)") to be evaluated directly -- either to be assigned or to be used as a controlling expression. You would also be able to use them in scalar assignments (in case you should have such a need). The egen functions only apply to variables. Finally, I would prefer that the relational operators respected missings. I suspect that it is too late to change this, as some of us may make use of the fact that missing is the maximal value in relational tests (I have done this with an "ending date" of a time period, where missing signified "no end yet") but it may be instructive to think about it. I think it would have been more sensible to have expressions such as (20) 3 < . (21) 3 > . to yield missing. Recall (11) gen byte child = age<18 if age~=. The -if- clause here would then be unnecessary. If this were to be done, then you need to address the issue about testing for equality with missing. That's one you might want to allow. (But what about . <= . ?) Or to be consistent, you might want all relational operations to yield missing if any operand is missing. Then you would need a separate way to test whether something is missing. For example, a missing function: (22) if missing(b) ... (23) gen byte q = missing(b) or maybe a different operator such as a triple equal sign: (24) if b===. ... ---- I wanted to include a discussion of DeMorgan's laws, but I realize that is off the topic and there was be no time for it at the presentation. The essential idea is that AND and OR are automorphisms of each other, with NOT being the mapping between them. Furthermore, all the binary logical operations discussed here (including the Liberal and Draconian versions) are really automorphisms of the same one operation. ---- Conclusion I imagine that the issues discussed here are simple compared to most of your concerns regarding statistical analysis and the development of Stata. But I find them to be fundamental and they were bypassed when Stata was initially developed. I feel they demand serious attention.