Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: _N in by-groups


From   Maarten Buis <maartenlbuis@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: _N in by-groups
Date   Fri, 19 Aug 2011 17:14:21 +0200

On Fri, Aug 19, 2011 at 4:40 PM,  <mcross@exemail.com.au> wrote:
> I too am confused regarding when _N is or isn't influenced
> by the –by :- prefix.
>
> I would like to remove a single outlier from each group within
> the following data set...
>
> input group var1
>
> 1      4
> 1      5
> 1     81
> 2      2
> 2      3
> 2      3
> 2     72
>
> end
>
> I would then like to calculate the mean for each group (with the outliers
> gone).
>
> I assumed that the following code would do the trick…
>
>
> by group (var1), sort: egen average = mean(var1) if var1 != var1[_N]

Your if condition is wrong: _N is the number of observations within
your group and var1[_N] gives you the _Nth value of in your entire
dataset. So in your second group _N = 4, so var1[_N] refers to the
fourth value of var1 in your entire dataset, i.e. 2 instead of 72,
which is obviously not what you want. Instead your if condition should
have been if _n != _N. This is still not very stable as it will be
susceptible to missing values. Better is:

gen mis = missing(var1)
by group miss (var1) : egen average = mean(var1) if _n != _N & mis == 0

However there seems to be a bug in the _gmean (the program that -egen-
calls to compute the means) in the way it handles such selection
criteria. So you'll need to do a bit more work:

*---------------- begin example ---------------
clear
input group var1
1      4
1      5
1     81
2      2
2      3
2      3
2     72
end

tempvar touse mis
quietly {
    gen byte `mis' = missing(var1)
    bys group `mis': gen byte `touse'=1 if _n != _N  & `mis' == 0
    sort `touse' group
    by `touse' group: gen double average = sum(var1)/sum((var1)<.) if `touse'==1
    by `touse' group: replace average = average[_N]
}
*----------------------- end example -------------------------
(For more on examples I sent to the Statalist see:
http://www.maartenbuis.nl/example_faq )

Notice however that from a scientific viewpoint such automatic
procedures of dropping the most informative observations in your data
is obviously completely and utterly wrong, see e.g.:
<http://www.stata.com/statalist/archive/2011-08/msg00398.html>

Hope this helps,
Maarten

--------------------------
Maarten L. Buis
Institut fuer Soziologie
Universitaet Tuebingen
Wilhelmstrasse 36
72074 Tuebingen
Germany


http://www.maartenbuis.nl
--------------------------

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index