Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Dropping 1% observations, but numbers do not match


From   Nick Cox <njcoxstata@gmail.com>
To   "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Dropping 1% observations, but numbers do not match
Date   Wed, 10 Apr 2013 17:16:01 +0100

Nick
njcoxstata@gmail.com

Now I think you are asking for

su rcon1410a, detail

scalar p1 = r(p1)

bysort entity (rcon1410a): drop if  rcon1410a[1] <= p1


Necessary that all are <= x  implies max <= x

Sufficient that one is <= x implies min <= x.

Nick

On 10 April 2013 16:58, Miguel Angel Duran <maduran@uma.es> wrote:

> Thanks again, Nick. I have read your paper: very clear and helpful. In
> relation to what you suggest (simplifying: drop if Var <= Number)
>  is a very straightforward way to get rid of values, but what I want to do
> is to eliminate all the subjects that, in a panel data set, in at least one
> period have a value that is lower than a threshold.
>
> Miguel.
>
> -----Mensaje original-----
> De: owner-statalist@hsphsun2.harvard.edu
> [mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Nick Cox
> Enviado el: miércoles, 10 de abril de 2013 13:15
> Para: statalist@hsphsun2.harvard.edu
> Asunto: Re: st: Dopping 1% observations, but numbers do not match
>
> It's not a good idea to use code you don't understand!
>
> I understand you as indicating that you are unclear about what [_N] implies
> under -by:-. My numbered point #3 put it in words. I wrote a tutorial which
> is easily accessible (there's a .pdf online, as below), so I won't add to
> what I have written.
>
> SJ-2-1  pr0004  . . . . . . . . . . Speaking Stata:  How to move step by:
> step
>         . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J.
> Cox
>         Q1/02   SJ 2(1):86--102                                  (no
> commands)
>         explains the use of the by varlist : construct to tackle
>         a variety of problems with group structure, ranging from
>         simple calculations for each of several groups to more
>         advanced manipulations that use the built-in _n and _N
>
> http://www.stata-journal.com/sjpdf.html?articlenum=pr0004
>
> Technique for what you are asking is exemplified by
>
> sysuse auto
> su mpg, detail
> scalar p1 = r(p1)
> count if mpg <= p1
> drop if mpg <= p1
>
> but I can't write it down without flagging that I don't recommend -drop-ping
> like this.
>
> Nick
> njcoxstata@gmail.com
>
> On 10 April 2013 12:01, Miguel Angel Duran <maduran@uma.es> wrote:
>> Thank you very much, Nick, for your quick answer. Just one additional
>> questin, if you don't mind. How would I drop unconditionally on the
>> identifier? And in relation to this (given your answer, just to be
>> sure I got it right), when an expression like "var[_N]" is used, what
>> does it exactly mean?
>
> Nick Cox
>
>> Numerous problems here, at least potentially.
>>
>> 0. Dropping outliers defined by an arbitrary threshold is not
>> everyone's idea of good data analysis practice. If you want comments
>> on what is "right", this needs defending.
>>
>> 1. Just because 0.0388193 is reported as the 1% point does not mean
>> that exactly 1% of observations have that value or less, even in a
>> situation where 1% of the number of observations is an integer. There
> could be ties.
>>
>> 2. Precision. 0.0388193 can't be held exactly as a binary number.
>> Perhaps what is reported as that is really something else, e.g.
>>
>> . di %21x 0.0388193
>> +1.3e01f8fe83ff0X-005
>>
>> . di %21x 0.03881931
>> +1.3e01fe5ce7b79X-005
>>
>> . di %21x 0.03881929
>> +1.3e01f3a020467X-005
>>
>> The number of decimal places you see does not correspond to what Stata
>> holds in storage.
>>
>> 3. You are dropping if and only if _all_ values for each identifier
>> are less than equal to your threshold. But that would leave in the
>> data any such values if there were a greater value for the same
>> identifier. That is, you are dropping conditionally on the identifier, not
> unconditionally.
>>
>> Nick
>> njcoxstata@gmail.com
>>
>> On 10 April 2013 11:22, Miguel Angel Duran <maduran@uma.es> wrote:
>>
>>> Will you please help me to know that what I am doing is right? To
>>> eliminate outliers, I am trying to drop 1% of the observations with
>>> the
>> lowest values.
>>> To do so I use 'bysort entity (rcon1410a): drop if  rcon1410a[_N] <=
>>> 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant
>>> variable, and 1% of the observations has a value that is lower than
>>> 0.0388193 (this value is obtained from 'sum rcon1410a, detail').
>>> Since I have 415,000 observations, I should be dropping 1%*415,000=4,150.
>>> Nevertheless, Stata informs me that using the abovementioned command
>>> I have dropped 400 observations. Is this all right?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index