Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: limitations of "generate" with missing data |

Date |
Tue, 12 Apr 2011 09:54:26 +0100 |

If we generalise to gen result = a > b and focus on -a-, -b- numeric (the comparison makes sense for strings too) then in a way it's reasonable to expect three possible answers, say 0 for false, 1 for true and ? for "can't tell; at least one argument is missing". I use ? for the sake of argument to detach the argument slightly from what Stata does at present. I've been at two users' meetings when talks have proposed three-way logic of this kind. The talk is sure-fire guaranteed to generate discussion about as long as the talk and to split the audience three ways, namely 1. Stata's two-way logic often bites -- most commonly perhaps in this case -- but you get used to it, mostly, it is too late for "fix" it, and no other solution is better. 2. There is a good case for a three-way logic but definitely not that proposed by the speaker, which is quite illogical. 3. The speaker is right and Stata is fundamentally flawed and should change its ways and documentation forthwith. I think the crunch is that although a different rule may make sense for at least some problems, the bigger difficulty is being consistent and having as few rules as possible and not introducing problems that are worse, and more difficult to understand. For example, and a very long article could be written about this, although I don't intend to do it: * Once ? is allowed as a logical result, then the truth tables need to be expanded for 0 & ?, 1 & ?, 0 | ?, etc., etc. * Once ? is allowed as a logical result, you need a rule on where it goes on sorting. (That need not be that ? is just numeric missing.) * Once ? is allowed as a logical result, what about ? + a, ? - a, ... log(?). Those are probably all easy but that won't stop users being puzzled by the results. * What about .a ... .z ??? * Do you need new functions and operators? * If you change Stata, quite what is allowed under version control? I know that no-one is necessarily proposing _any_ of this: I am just showing one way or the other how many threads are tangled together when you start wanting something different. Any way, note that interpretation of gen result = a > b if !missing(a, b) is that you don't know what the result should be if either argument is missing, not that Stata can't tell. But you get missings from missings either way. Nick On Tue, Apr 12, 2011 at 5:02 AM, Steven Samuels <sjsamuels@gmail.com> wrote: > Michael, lest you think this problem is unique to Stata, I would add that SAS sorts missing values before, not after, non-missing ones. SPSS will sort some missing values ("user-defined"), but not others ("system missing"). > > Steve > sjsamuels@gmail.com > > > > > > On Apr 11, 2011, at 6:15 PM, Nick Cox wrote: > > The underlying problem can be illustrated by sorting. Suppose we > -sort- a variable, which contains missings, in numeric order. Where do > the missings go? We need a decision: either missing is regarded as > larger than any non-missing, or smaller than any non-missing. Stata > made the first decision. > > Any way, here are some solutions: > > gen myvar1 = (gread_comp_score_pcnt>.79) if gread_comp_pcnt < . > > gen myvar2 = (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt) > > gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79) > > gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt)) > > (5. don't throw away information by turning a measure into an indicator!) > > Nick > > On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello > <michaelavcostello@gmail.com> wrote: >> Statalisters, >> >> I recently ran into a problem with the following dataset: >> >> . tab gread_comp_score_pcnt, m >> gread_comp_ | >> score_pcnt | Freq. Percent Cum. >> ------------+----------------------------------- >> 0 | 150 7.50 7.50 >> .2 | 85 4.25 11.75 >> .4 | 97 4.85 16.60 >> .6 | 82 4.10 20.70 >> .8 | 72 3.60 24.30 >> 1 | 15 0.75 25.05 >> . | 1,499 74.95 100.00 >> ------------+----------------------------------- >> Total | 2,000 100.00 >> >> The high number of "missing" is by design, a by-product of a >> horizontally structured dataset that I have yet to rectify. >> >> When I run the command: >> gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79) >> I am left with >> >> . tab gread_comp_score_pcnt80, m >> gread_comp_ | >> score_pcnt8 | >> 0 | Freq. Percent Cum. >> ------------+----------------------------------- >> 0 | 414 20.70 20.70 >> 1 | 1,586 79.30 100.00 >> ------------+----------------------------------- >> Total | 2,000 100.00 >> >> As you can see, the 87 values above .79 were set to 1, but so were all >> the missing values!! I have toyed with the code a bit, trying >> variations such as >> . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 & >> gread_comp_score_pcnt!=.) >> but that converts all the missing to 0's, which is only marginally better. >> >> So the question is, is there some way to use a single, precise line of >> code to create eighty-seven 1's, four hundred fourteen 0's and 1499 >> Missing values in one dummy variable? I know I can do it with several >> lines of code, but I'm looking for something more concise, as it needs >> to run many hundreds of times. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: limitations of "generate" with missing data***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: limitations of "generate" with missing data***From:*Michael Costello <michaelavcostello@gmail.com>

**Re: st: limitations of "generate" with missing data***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: limitations of "generate" with missing data***From:*Steven Samuels <sjsamuels@gmail.com>

- Prev by Date:
**Re: st: Fw: influential observations** - Next by Date:
**Re: st: Fw: influential observations** - Previous by thread:
**Re: st: limitations of "generate" with missing data** - Next by thread:
**Re: st: limitations of "generate" with missing data** - Index(es):