Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Steven Samuels <sjsamuels@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: limitations of "generate" with missing data |

Date |
Tue, 12 Apr 2011 00:02:53 -0400 |

Michael, lest you think this problem is unique to Stata, I would add that SAS sorts missing values before, not after, non-missing ones. SPSS will sort some missing values ("user-defined"), but not others ("system missing"). Steve sjsamuels@gmail.com On Apr 11, 2011, at 6:15 PM, Nick Cox wrote: The underlying problem can be illustrated by sorting. Suppose we -sort- a variable, which contains missings, in numeric order. Where do the missings go? We need a decision: either missing is regarded as larger than any non-missing, or smaller than any non-missing. Stata made the first decision. Any way, here are some solutions: gen myvar1 = (gread_comp_score_pcnt>.79) if gread_comp_pcnt < . gen myvar2 = (gread_comp_score_pcnt>.79) if !missing(gread_comp_pcnt) gen myvar3 = cond(missing(gread_comp_pcnt), ., (gread_comp_score_pcnt > .79) gen myvar4 = (gread_comp_score_pcnt > .79) / (!missing(gread_comp_pcnt)) (5. don't throw away information by turning a measure into an indicator!) Nick On Mon, Apr 11, 2011 at 11:01 PM, Michael Costello <michaelavcostello@gmail.com> wrote: > Statalisters, > > I recently ran into a problem with the following dataset: > > . tab gread_comp_score_pcnt, m > gread_comp_ | > score_pcnt | Freq. Percent Cum. > ------------+----------------------------------- > 0 | 150 7.50 7.50 > .2 | 85 4.25 11.75 > .4 | 97 4.85 16.60 > .6 | 82 4.10 20.70 > .8 | 72 3.60 24.30 > 1 | 15 0.75 25.05 > . | 1,499 74.95 100.00 > ------------+----------------------------------- > Total | 2,000 100.00 > > The high number of "missing" is by design, a by-product of a > horizontally structured dataset that I have yet to rectify. > > When I run the command: > gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79) > I am left with > > . tab gread_comp_score_pcnt80, m > gread_comp_ | > score_pcnt8 | > 0 | Freq. Percent Cum. > ------------+----------------------------------- > 0 | 414 20.70 20.70 > 1 | 1,586 79.30 100.00 > ------------+----------------------------------- > Total | 2,000 100.00 > > As you can see, the 87 values above .79 were set to 1, but so were all > the missing values!! I have toyed with the code a bit, trying > variations such as > . gen gread_comp_score_pcnt80= (gread_comp_score_pcnt>.79 & > gread_comp_score_pcnt!=.) > but that converts all the missing to 0's, which is only marginally better. > > So the question is, is there some way to use a single, precise line of > code to create eighty-seven 1's, four hundred fourteen 0's and 1499 > Missing values in one dummy variable? I know I can do it with several > lines of code, but I'm looking for something more concise, as it needs > to run many hundreds of times. > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: limitations of "generate" with missing data***From:*Nick Cox <njcoxstata@gmail.com>

**References**:**st: limitations of "generate" with missing data***From:*Michael Costello <michaelavcostello@gmail.com>

**Re: st: limitations of "generate" with missing data***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: st: Generating dummies with xi** - Next by Date:
**st: Fw: influential observations** - Previous by thread:
**Re: st: limitations of "generate" with missing data** - Next by thread:
**Re: st: limitations of "generate" with missing data** - Index(es):