Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Maarten Buis <maartenlbuis@gmail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: _N in by-groups |
Date | Fri, 19 Aug 2011 17:14:21 +0200 |
On Fri, Aug 19, 2011 at 4:40 PM, <mcross@exemail.com.au> wrote: > I too am confused regarding when _N is or isn't influenced > by the –by :- prefix. > > I would like to remove a single outlier from each group within > the following data set... > > input group var1 > > 1 4 > 1 5 > 1 81 > 2 2 > 2 3 > 2 3 > 2 72 > > end > > I would then like to calculate the mean for each group (with the outliers > gone). > > I assumed that the following code would do the trick… > > > by group (var1), sort: egen average = mean(var1) if var1 != var1[_N] Your if condition is wrong: _N is the number of observations within your group and var1[_N] gives you the _Nth value of in your entire dataset. So in your second group _N = 4, so var1[_N] refers to the fourth value of var1 in your entire dataset, i.e. 2 instead of 72, which is obviously not what you want. Instead your if condition should have been if _n != _N. This is still not very stable as it will be susceptible to missing values. Better is: gen mis = missing(var1) by group miss (var1) : egen average = mean(var1) if _n != _N & mis == 0 However there seems to be a bug in the _gmean (the program that -egen- calls to compute the means) in the way it handles such selection criteria. So you'll need to do a bit more work: *---------------- begin example --------------- clear input group var1 1 4 1 5 1 81 2 2 2 3 2 3 2 72 end tempvar touse mis quietly { gen byte `mis' = missing(var1) bys group `mis': gen byte `touse'=1 if _n != _N & `mis' == 0 sort `touse' group by `touse' group: gen double average = sum(var1)/sum((var1)<.) if `touse'==1 by `touse' group: replace average = average[_N] } *----------------------- end example ------------------------- (For more on examples I sent to the Statalist see: http://www.maartenbuis.nl/example_faq ) Notice however that from a scientific viewpoint such automatic procedures of dropping the most informative observations in your data is obviously completely and utterly wrong, see e.g.: <http://www.stata.com/statalist/archive/2011-08/msg00398.html> Hope this helps, Maarten -------------------------- Maarten L. Buis Institut fuer Soziologie Universitaet Tuebingen Wilhelmstrasse 36 72074 Tuebingen Germany http://www.maartenbuis.nl -------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/