Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: limitations of "generate" with missing data |

Date |
Tue, 12 Apr 2011 22:44:09 +0100 |

See also, or rather instead FAQ . . . . . . . . . . . . . . . . Logical expressions and missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Gould 2/03 Why is x > 1000 true when x contains missing value? http://www.stata.com/support/faqs/data/values.html On Tue, Apr 12, 2011 at 9:54 AM, Nick Cox <njcoxstata@gmail.com> wrote: > If we generalise to > > gen result = a > b > > and focus on -a-, -b- numeric (the comparison makes sense for strings > too) then in a way it's reasonable to expect three possible answers, > say 0 for false, 1 for true and ? for "can't tell; at least one > argument is missing". I use ? for the sake of argument to detach the > argument slightly from what Stata does at present. > > I've been at two users' meetings when talks have proposed three-way > logic of this kind. The talk is sure-fire guaranteed to generate > discussion about as long as the talk and to split the audience three > ways, namely > > 1. Stata's two-way logic often bites -- most commonly perhaps in this > case -- but you get used to it, mostly, it is too late for "fix" it, > and no other solution is better. > > 2. There is a good case for a three-way logic but definitely not that > proposed by the speaker, which is quite illogical. > > 3. The speaker is right and Stata is fundamentally flawed and should > change its ways and documentation forthwith. > > I think the crunch is that although a different rule may make sense > for at least some problems, the bigger difficulty is being consistent > and having as few rules as possible and not introducing problems that > are worse, and more difficult to understand. For example, and a very > long article could be written about this, although I don't intend to > do it: > > * Once ? is allowed as a logical result, then the truth tables need to > be expanded for 0 & ?, 1 & ?, 0 | ?, etc., etc. > > * Once ? is allowed as a logical result, you need a rule on where it > goes on sorting. (That need not be that ? is just numeric missing.) > > * Once ? is allowed as a logical result, what about ? + a, ? - a, ... > log(?). Those are probably all easy but that won't stop users being > puzzled by the results. > > * What about .a ... .z ??? > > * Do you need new functions and operators? > > * If you change Stata, quite what is allowed under version control? > > I know that no-one is necessarily proposing _any_ of this: I am just > showing one way or the other how many threads are tangled together > when you start wanting something different. > > Any way, note that interpretation of > > gen result = a > b if !missing(a, b) > > is that you don't know what the result should be if either argument is > missing, not that Stata can't tell. But you get missings from missings > either way. > > Nick > > On Tue, Apr 12, 2011 at 5:02 AM, Steven Samuels <sjsamuels@gmail.com> wrote: >> Michael, lest you think this problem is unique to Stata, I would add that SAS sorts missing values before, not after, non-missing ones. SPSS will sort some missing values ("user-defined"), but not others ("system missing"). >> >> Steve >> sjsamuels@gmail.com >> >> >> >> >> >> On Apr 11, 2011, at 6:15 PM, Nick Cox wrote: >> >> The underlying problem can be illustrated by sorting. Suppose we >> -sort- a variable, which contains missings, in numeric order. Where do >> the missings go? We need a decision: either missing is regarded as >> larger than any non-missing, or smaller than any non-missing. Stata >> made the first decision. >> >> Any way, here are some solutions: >> >> gen myvar1 = (gread_comp_score_pcnt>.79) if gread_comp_pcnt < . >> >> gen myvar2 = (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt) >> >> gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79) >> >> gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt)) >> >> (5. don't throw away information by turning a measure into an indicator!) >> >> Nick >> >> On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello >> <michaelavcostello@gmail.com> wrote: >>> Statalisters, >>> >>> I recently ran into a problem with the following dataset: >>> >>> . tab gread_comp_score_pcnt, m >>> gread_comp_ | >>> score_pcnt | Freq. Percent Cum. >>> ------------+----------------------------------- >>> 0 | 150 7.50 7.50 >>> .2 | 85 4.25 11.75 >>> .4 | 97 4.85 16.60 >>> .6 | 82 4.10 20.70 >>> .8 | 72 3.60 24.30 >>> 1 | 15 0.75 25.05 >>> . | 1,499 74.95 100.00 >>> ------------+----------------------------------- >>> Total | 2,000 100.00 >>> >>> The high number of "missing" is by design, a by-product of a >>> horizontally structured dataset that I have yet to rectify. >>> >>> When I run the command: >>> gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79) >>> I am left with >>> >>> . tab gread_comp_score_pcnt80, m >>> gread_comp_ | >>> score_pcnt8 | >>> 0 | Freq. Percent Cum. >>> ------------+----------------------------------- >>> 0 | 414 20.70 20.70 >>> 1 | 1,586 79.30 100.00 >>> ------------+----------------------------------- >>> Total | 2,000 100.00 >>> >>> As you can see, the 87 values above .79 were set to 1, but so were all >>> the missing values!! I have toyed with the code a bit, trying >>> variations such as >>> . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 & >>> gread_comp_score_pcnt!=.) >>> but that converts all the missing to 0's, which is only marginally better. >>> >>> So the question is, is there some way to use a single, precise line of >>> code to create eighty-seven 1's, four hundred fourteen 0's and 1499 >>> Missing values in one dummy variable? I know I can do it with several >>> lines of code, but I'm looking for something more concise, as it needs >>> to run many hundreds of times. > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: limitations of "generate" with missing data***From:*Michael Costello <michaelavcostello@gmail.com>

**Re: st: limitations of "generate" with missing data***From:*Nick Cox <njcoxstata@gmail.com>

**Re: st: limitations of "generate" with missing data***From:*Steven Samuels <sjsamuels@gmail.com>

**Re: st: limitations of "generate" with missing data***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Renaming variables, bis** - Next by Date:
**Re: st: Sample: drawing the same "random" sample** - Previous by thread:
**Re: st: limitations of "generate" with missing data** - Next by thread:
**st: Fixed Effects GLS** - Index(es):