Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: limitations of "generate" with missing data

From	Nick Cox <[email protected]>
To	[email protected]
Subject	Re: st: limitations of "generate" with missing data
Date	Tue, 12 Apr 2011 22:44:09 +0100

See also, or rather instead

FAQ     . . . . . . . . . . . . . . . . Logical expressions and missing values
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Gould
        2/03    Why is x > 1000 true when x contains missing value?
                http://www.stata.com/support/faqs/data/values.html

On Tue, Apr 12, 2011 at 9:54 AM, Nick Cox <[email protected]> wrote:
> If we generalise to
>
> gen result = a > b
>
> and focus on -a-, -b- numeric (the comparison makes sense for strings
> too) then in a way it's reasonable to expect three possible answers,
> say 0 for false, 1 for true and ? for "can't tell; at least one
> argument is missing". I use ? for the sake of argument to detach the
> argument slightly from what Stata does at present.
>
> I've been at two users' meetings when talks have proposed three-way
> logic of this kind. The talk is sure-fire guaranteed to generate
> discussion about as long as the talk and to split the audience three
> ways, namely
>
> 1. Stata's two-way logic often bites -- most commonly perhaps in this
> case -- but you get used to it, mostly, it is too late for "fix" it,
> and no other solution is better.
>
> 2. There is a good case for a three-way logic but definitely not that
> proposed by the speaker, which is quite illogical.
>
> 3. The speaker is right and Stata is fundamentally flawed and should
> change its ways and documentation forthwith.
>
> I think the crunch is that although a different rule may make sense
> for at least some problems, the bigger difficulty is being consistent
> and having as few rules as possible and not introducing problems that
> are worse, and more difficult to understand. For example, and a very
> long article could be written about this, although I don't intend to
> do it:
>
> * Once ? is allowed as a logical result, then the truth tables need to
> be expanded for 0 & ?, 1 & ?, 0 | ?, etc., etc.
>
> * Once ? is allowed as a logical result, you need a rule on where it
> goes on sorting. (That need not be that ? is just numeric missing.)
>
> * Once ? is allowed as a logical result, what about ? + a, ? - a, ...
> log(?). Those are probably all easy but that won't stop users being
> puzzled by the results.
>
> * What about .a ... .z ???
>
> * Do you need new functions and operators?
>
> * If you change Stata, quite what is allowed under version control?
>
> I know that no-one is necessarily proposing _any_ of this: I am just
> showing one way or the other how many threads are tangled together
> when you start wanting something different.
>
> Any way, note that interpretation of
>
> gen result = a > b if !missing(a, b)
>
> is that you don't know what the result should be if either argument is
> missing, not that Stata can't tell. But you get missings from missings
> either way.
>
> Nick
>
> On Tue, Apr 12, 2011 at 5:02 AM, Steven Samuels <[email protected]> wrote:
>> Michael, lest you think this problem is unique to Stata, I would add that SAS sorts missing values before, not after, non-missing ones. SPSS will sort some missing values ("user-defined"), but not others ("system missing").
>>
>> Steve
>> [email protected]
>>
>>
>>
>>
>>
>> On Apr 11, 2011, at 6:15 PM, Nick Cox wrote:
>>
>> The underlying problem can be illustrated by sorting. Suppose we
>> -sort- a variable, which contains missings, in numeric order. Where do
>> the missings go? We need a decision: either missing is regarded as
>> larger than any non-missing, or smaller than any non-missing. Stata
>> made the first decision.
>>
>> Any way, here are some solutions:
>>
>> gen myvar1 =  (gread_comp_score_pcnt>.79) if gread_comp_pcnt < .
>>
>> gen myvar2 =  (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt)
>>
>> gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79)
>>
>> gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt))
>>
>> (5. don't throw away information by turning a measure into an indicator!)
>>
>> Nick
>>
>> On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello
>> <[email protected]> wrote:
>>> Statalisters,
>>>
>>> I recently ran into a problem with the following dataset:
>>>
>>> . tab  gread_comp_score_pcnt, m
>>> gread_comp_ |
>>>  score_pcnt |      Freq.     Percent        Cum.
>>> ------------+-----------------------------------
>>>          0 |        150        7.50        7.50
>>>         .2 |         85        4.25       11.75
>>>         .4 |         97        4.85       16.60
>>>         .6 |         82        4.10       20.70
>>>         .8 |         72        3.60       24.30
>>>          1 |         15        0.75       25.05
>>>          . |      1,499       74.95      100.00
>>> ------------+-----------------------------------
>>>      Total |      2,000      100.00
>>>
>>> The high number of "missing" is by design, a by-product of a
>>> horizontally structured dataset that I have yet to rectify.
>>>
>>> When I run the command:
>>> gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79)
>>> I am left with
>>>
>>> . tab  gread_comp_score_pcnt80, m
>>> gread_comp_ |
>>> score_pcnt8 |
>>>          0 |      Freq.     Percent        Cum.
>>> ------------+-----------------------------------
>>>          0 |        414       20.70       20.70
>>>          1 |      1,586       79.30      100.00
>>> ------------+-----------------------------------
>>>      Total |      2,000      100.00
>>>
>>> As you can see, the 87 values above .79 were set to 1, but so were all
>>> the missing values!!  I have toyed with the code a bit, trying
>>> variations such as
>>> . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 &
>>> gread_comp_score_pcnt!=.)
>>> but that converts all the missing to 0's, which is only marginally better.
>>>
>>> So the question is, is there some way to use a single, precise line of
>>> code to create eighty-seven 1's, four hundred fourteen  0's and 1499
>>> Missing values in one dummy variable?  I know I can do it with several
>>> lines of code, but I'm looking for something more concise, as it needs
>>> to run many hundreds of times.
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: limitations of "generate" with missing data
  - From: Michael Costello <[email protected]>
- Re: st: limitations of "generate" with missing data
  - From: Nick Cox <[email protected]>
- Re: st: limitations of "generate" with missing data
  - From: Steven Samuels <[email protected]>
- Re: st: limitations of "generate" with missing data
  - From: Nick Cox <[email protected]>

Prev by Date: Re: st: Renaming variables, bis
Next by Date: Re: st: Sample: drawing the same "random" sample
Previous by thread: Re: st: limitations of "generate" with missing data
Next by thread: st: Fixed Effects GLS
Index(es):
- Date
- Thread