Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: Missing as true - Was: Re: st: RE: Another Stata feature


From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   RE: Missing as true - Was: Re: st: RE: Another Stata feature
Date   Thu, 8 Jan 2004 11:50:55 -0000

I think Allan still misses a real issue here, which arises 
from two basic principles followed all the way through. 

1. any self-respecting statistical program must allow 
representations of missing values; 

2. programmers and users alike find two-way logic 
much easier to manage, in total, than three-way logic. 

I guess no-one at all has any trouble with 1. 
The issue is 2. The nuance to note -- crucial here -- 
is "in total". 

As a matter of history, Stata decided that numeric missing 
should have non-zero representation (well, that's 
essential); it is regarded as very 
large positive; and it is regarded as true. (To regard 
missings as _either_ large positive _or_ large negative 
is essential for at least one purpose, sorting, as 
every observation must go somewhere.)  

You can say that "missing => true" was  
a bad design decision; "missing => false" is 
presumably no better; so consider the alternative
of a three-way logic. 

In examples like 

. list if myvar > 0 

what the user wants, almost all the time, is 
to see positive values; when that user gets 
missings as well it is, usually, somewhere 
between an irrelevant extra and, in other 
contexts, not what was wanted at all, and 
so a bug (although people seem to want 
to blame Stata for the bug, not their own 
not-quite-careful-enough programming). Let's 
all agree: even very experienced users can get bitten 
by this, as we temporarily forget the principles
which Stata is following rigidly and rigorously. 
Me too. 

Examples like this are indeed persuasive. One 
is tempted to say that Stata should be smart
enough to divide values of -myvar- into 
true, false and irrelevant (because missing) 
and show only those which are true. 
However, examples like this are not the point 
at all. The point is to consider the 
consequences of following such three-way logic 
all the way through; or to decide on when 
Stata should use three-way logic and when 
it should use two-way logic, and how in turn 
you explain that distinction. 

Let's suppose Stata could do this. It 
would ignore false _and_ "irrelevant"
and show only true, given that -list- 
command.  

Now suppose you want two conditions, in 
some combination, 

. list if myvar > 0 & yourvar > 0 

. list if myvar > 0 | myvar > 0 

Now please, for this only slightly 
more complicated situation, 

1. fill in truth tables 

&                 true  false irrelevant  
true               
false
irrelevant 

|                 true  false irrelevant  
true               
false
irrelevant 

2. imagine working with such combinations 
for the rest of your Stata life.  

3. imagine explaining this, repeatedly, 
to other users, given that you, are, 
probably, the local expert. 

No thanks! 

Incidentally, I don't think Allan's political excursus 
explains what was supposedly a problem with 
string variables. 

Nick 
[email protected] 

Allan Reese
> 
> On Wed, 7 Jan 2004, Bill Rising wrote [with RAR's inserts]:
> > ..., it would make Stata
> > code easier to read and less prone to error if people 
> could code the
> > [potentially RAR] incorrect
> >
> > regress foo bar if snafu
> >
> > instead of the [intended? RAR] correct
> >
> > regress foo bar if snafu & snafu < .
> >
> > for snafu being some sort of indicator which could be missing.
> >
> > I've used Stata long enough that the latter comes natural 
> to me. Still,
> > I'd hate to see how many analyses have been found invalid 
> because of
> > folks forgetting the extra 'less than missing' clause.
> 
> My point exactly.  It *is* documented, thus making it a 
> feature, but who
> reads documentation?  It is *known* to all members of this 
> list? to all
> Stata users?  I doubt it.  It raises anomalies, as here:
> 
> . gen m = var1>0
> . gen l = var1<0
> . list var1 m l
>      | var1   m   l |
>   1. |    1   1   0 |
>   2. |    2   1   0 |
>   3. |    3   1   0 |
>   4. |    .   1   0 |
>   5. |   -1   0   1 |
>   6. |    0   0   0 |
> 
> Within the Stata language, "missing" is a positive number, 
> but that is not
> a natural treatment of missing data.  In the same way that 
> "replace" by
> default reports "n values changed", I suggest it would be 
> more sporting to
> report "missing values used in calculation - check answers".
> 
> Since Nick insists I spell out the joke (?), we were told 
> that the basis
> for invading Iraq was that wmd was definitely TRUE.  It 
> subsequently turns
> out that the data were incomplete or inconclusive.  But if 
> wmd>0 computes
> as TRUE for missing data, they can justify any political or 
> management
> decision.
> 
> I have had similar exchanges on the discussion list devoted 
> to spreadsheet
> use.  The techies say, "It's a documented feature, so 
> everyone knows", and
> the managers say, "We got the answer from the computer, so 
> it must be
> correct."  There is a wonderful area of computer science devoted to
> *proving* programs are correct; I've never seen evidence of 
> an automated
> procedure that it capable of checking that the correct 
> variable was named
> in an expression or that the correct operator was used.
> 
> History demonstrates that it is only after a sequence of 
> disasters that
> "management" accept that systems should be self-checking and error
> avoiding.  Relying on people to "do the right thing" in all 
> circumstances
> is a proven recipe for disasters.  WRT software, why can't 
> we abridge the
> historic process?

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index