Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Dropping 1% observations, but numbers do not match

From   "Miguel Angel Duran" <>
To   <>
Subject   RE: st: Dropping 1% observations, but numbers do not match
Date   Wed, 10 Apr 2013 17:58:46 +0200

Thanks again, Nick. I have read your paper: very clear and helpful. In
relation to what you suggest (simplifying: drop if Var <= Number)
 is a very straightforward way to get rid of values, but what I want to do
is to eliminate all the subjects that, in a panel data set, in at least one
period have a value that is lower than a threshold.


-----Mensaje original-----
[] En nombre de Nick Cox
Enviado el: miércoles, 10 de abril de 2013 13:15
Asunto: Re: st: Dopping 1% observations, but numbers do not match

It's not a good idea to use code you don't understand!

I understand you as indicating that you are unclear about what [_N] implies
under -by:-. My numbered point #3 put it in words. I wrote a tutorial which
is easily accessible (there's a .pdf online, as below), so I won't add to
what I have written.

SJ-2-1  pr0004  . . . . . . . . . . Speaking Stata:  How to move step by:
        . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  N. J.
        Q1/02   SJ 2(1):86--102                                  (no
        explains the use of the by varlist : construct to tackle
        a variety of problems with group structure, ranging from
        simple calculations for each of several groups to more
        advanced manipulations that use the built-in _n and _N

Technique for what you are asking is exemplified by

sysuse auto
su mpg, detail
scalar p1 = r(p1)
count if mpg <= p1
drop if mpg <= p1

but I can't write it down without flagging that I don't recommend -drop-ping
like this.


On 10 April 2013 12:01, Miguel Angel Duran <> wrote:
> Thank you very much, Nick, for your quick answer. Just one additional 
> questin, if you don't mind. How would I drop unconditionally on the 
> identifier? And in relation to this (given your answer, just to be 
> sure I got it right), when an expression like "var[_N]" is used, what 
> does it exactly mean?

Nick Cox

> Numerous problems here, at least potentially.
> 0. Dropping outliers defined by an arbitrary threshold is not 
> everyone's idea of good data analysis practice. If you want comments 
> on what is "right", this needs defending.
> 1. Just because 0.0388193 is reported as the 1% point does not mean 
> that exactly 1% of observations have that value or less, even in a 
> situation where 1% of the number of observations is an integer. There
could be ties.
> 2. Precision. 0.0388193 can't be held exactly as a binary number.
> Perhaps what is reported as that is really something else, e.g.
> . di %21x 0.0388193
> +1.3e01f8fe83ff0X-005
> . di %21x 0.03881931
> +1.3e01fe5ce7b79X-005
> . di %21x 0.03881929
> +1.3e01f3a020467X-005
> The number of decimal places you see does not correspond to what Stata 
> holds in storage.
> 3. You are dropping if and only if _all_ values for each identifier 
> are less than equal to your threshold. But that would leave in the 
> data any such values if there were a greater value for the same 
> identifier. That is, you are dropping conditionally on the identifier, not
> Nick
> On 10 April 2013 11:22, Miguel Angel Duran <> wrote:
>> Will you please help me to know that what I am doing is right? To 
>> eliminate outliers, I am trying to drop 1% of the observations with 
>> the
> lowest values.
>> To do so I use 'bysort entity (rcon1410a): drop if  rcon1410a[_N] <= 
>> 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant 
>> variable, and 1% of the observations has a value that is lower than
>> 0.0388193 (this value is obtained from 'sum rcon1410a, detail'). 
>> Since I have 415,000 observations, I should be dropping 1%*415,000=4,150.
>> Nevertheless, Stata informs me that using the abovementioned command 
>> I have dropped 400 observations. Is this all right?
*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index