Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: My last word on strange world


From   [email protected] (William Gould, StataCorp LP)
To   [email protected]
Subject   Re: st: My last word on strange world
Date   Fri, 11 Jan 2008 10:35:49 -0600

I apolgize in advance for dragging on the conversation about missing values,
but my last post was too philosphical and I want to show an concrete 
example illustrating what I think the problem is with the other proposals, 
which is not to say there is not merit in the other proposals.

In reponse to my reference to the law of the excluded middle -- P or not-P is
true -- Jeph Herrin <[email protected]> replied, 

> Not at all. You are presuming that
> 
>     ![42 > (undefined)] == [42 !> (undefined)]

Yes I am.  Most programmers and computer users would assume the statements
!(x>y) and (x<=y) are equivalent.  If it is difficult to explain to users that
(income>10000) includes missing values, try explaining that !(x>y) and (x<=y)
are not the same statement when missing values are involved.

Svend Juul <[email protected]> wrote, 

> The decision made is perfectly logical, but the following alternative is
> equally logical and much more in line with the expectations of ordinary
> users:
>
> [...]
>  - let (x>100)  evaluate to false if x is missing
>  - let (x==100) evaluate to false if x is missing
>  - let (x<100)  evaluate to false if x is missing

It therefore follows

     x>=100      evaluates to false if x is missing

and thus the statements (x>=100) and (x<100) are both false and, as I asked
about Jeph's comment, is that really in line with the expectations of 
ordinary users?

Let's look at an example.  Consider

         . keep if income>=baseincome & age>=minage 

versus 
         . drop if income<baseincome | age<minage 

Under the proposed change, different observations will appear in the resulting
data.  They differ because statements like income<baseincome and its
opposite income>=baseincome simultaneously evaluate to FALSE in the presence
of missing values, and -keep- and -drop- perform the opposite actions.

To appreciate how difficult this logic is to understand, try explaining 
to yourself exactly how the two datasets will differ.  Try answering
the following question:  Under what conditions will the second dataset
have more observations than the first?  I predict you will need to pull 
out pencil and paper, write down the 2^4 = 16 missing-value patterns, 
and think carefully about how each of the expressions would 
evaluate.  To do that, you need to know one more thing:  Under Svend's
proposal, x==. would evaluate to TRUE if x is missing, so assume x>=. would
evaluate to TRUE as well.

I repeat what Joseph Coveney <[email protected]> already said:  each
implementation results in its own gotcha.  I admit that how Stata treats
missing values -- as if they were infinity -- has a gotcha, and I know both
Jeph and Svend understand that other proposals have a gotcha, too.  They
believe that the gotchas in other proposals will be easier for "ordinary"
users to understand.  The above is a counterexample.

The last time we at StataCorp thought carefully about this problem, we 
learned that it is not enough to write down the statements for which the
proposal better meets expectations.  You have to go searching for the
statements that do the unexpected and then ask, "How easy will that be to
explain to users?"

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index