Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Keep/Drop Observations for Top/Bottom X%


From   Nick Cox <njcoxstata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Keep/Drop Observations for Top/Bottom X%
Date   Thu, 11 Oct 2012 11:07:27 +0100

That's undoubtedly correct. If  you keep observations in memory that
you don't use, then indeed every analysis command needs an -if-
qualifier. It's best to generate an indicator say

. gen thisuse = inrange(mpg, 29, .)

and follow with commands -if thisuse-.

When people want to do this, in my experience they want to play with
focusing on different subsets, which would usually mean reading the
whole dataset back in again on the -drop- strategy. Also, with the
-drop- strategy you can't compare those -drop-ped with those not
-drop-ped.

I do -drop- observations that aren't convenient all the time, but for
problems like Lisa's I would lean marginally to what I suggest.

There is at least a small down-side to every way of doing this.

Nick

On Thu, Oct 11, 2012 at 10:54 AM, Justina Fischer <JAVFischer@gmx.de> wrote:
> Hi Nick,
>
> in principle you might be right.
>
> However, for reasons of practicability it is sometimes recommendable for subset analysis to simply upload the full data and drop a part rather than working with an 'if' restriction throughout all regressions.
>
> HTH
>
> Jusitna
>
>
> -------- Original-Nachricht --------
>> Datum: Thu, 11 Oct 2012 10:46:02 +0100
>> Von: Nick Cox <njcoxstata@gmail.com>
>> An: statalist@hsphsun2.harvard.edu
>> Betreff: Re: st: Keep/Drop Observations for Top/Bottom X%
>
>> You need not -keep- or -drop- to do this; in fact -keep- or -drop-
>> here is usually a bad idea.
>>
>> (Furthermore, regressions of this kind are often more problematic than
>> they seem, but I'll let others expand on that if they wish.)
>>
>> For full flexibility here, skip -summarize- and go straight to -_pctile-.
>>
>> For example,
>>
>> . sysuse auto
>> (1978 Automobile Data)
>>
>> . _pctile mpg, p(10 90)
>>
>> . ret li
>>
>> scalars:
>>                  r(r1) =  14
>>                  r(r2) =  29
>>
>> So you can follow up with
>>
>> ... if mpg >= 29
>>
>> Warnings:
>>
>> 1. Watch out for ties.
>>
>> 2. Watch out for missing values at the top end.
>>
>> ... if mpg >= 29
>>
>> would include missings on -mpg- (if there were any).  -if inrange(mpg,
>> 29, .)- excludes the missings.
>>
>> Nick
>>
>> On Thu, Oct 11, 2012 at 10:34 AM, Lisa Wang <lhwang0925@gmail.com> wrote:
>>
>> > I am unsure as to how I would go about keeping or dropping the
>> > top/bottom X% of observations of a variable. I would like to do this
>> > for further analysis on a subset of my data. For instance, I want to
>> > do some further regressions for the top 10% of my observations based
>> > on 'distance from home' and not the whole data set.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index