Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Dopping 1% observations, but numbers do not match


From   "Miguel Angel Duran" <maduran@uma.es>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Dopping 1% observations, but numbers do not match
Date   Wed, 10 Apr 2013 13:01:42 +0200

Thank you very much, Nick, for your quick answer. Just one additional
questin, if you don't mind. How would I drop unconditionally on the
identifier? And in relation to this (given your answer, just to be sure I
got it right), when an expression like "var[_N]" is used, what does it
exactly mean?

-----Mensaje original-----
De: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Nick Cox
Enviado el: miércoles, 10 de abril de 2013 12:47
Para: statalist@hsphsun2.harvard.edu
Asunto: Re: st: Dopping 1% observations, but numbers do not match

Numerous problems here, at least potentially.

0. Dropping outliers defined by an arbitrary threshold is not everyone's
idea of good data analysis practice. If you want comments on what is
"right", this needs defending.

1. Just because 0.0388193 is reported as the 1% point does not mean that
exactly 1% of observations have that value or less, even in a situation
where 1% of the number of observations is an integer. There could be ties.

2. Precision. 0.0388193 can't be held exactly as a binary number.
Perhaps what is reported as that is really something else, e.g.

. di %21x 0.0388193
+1.3e01f8fe83ff0X-005

. di %21x 0.03881931
+1.3e01fe5ce7b79X-005

. di %21x 0.03881929
+1.3e01f3a020467X-005

The number of decimal places you see does not correspond to what Stata holds
in storage.

3. You are dropping if and only if _all_ values for each identifier are less
than equal to your threshold. But that would leave in the data any such
values if there were a greater value for the same identifier. That is, you
are dropping conditionally on the identifier, not unconditionally.

Nick
njcoxstata@gmail.com

On 10 April 2013 11:22, Miguel Angel Duran <maduran@uma.es> wrote:

> Will you please help me to know that what I am doing is right? To 
> eliminate outliers, I am trying to drop 1% of the observations with the
lowest values.
> To do so I use 'bysort entity (rcon1410a): drop if  rcon1410a[_N] <= 
> 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant 
> variable, and 1% of the observations has a value that is lower than 
> 0.0388193 (this value is obtained from 'sum rcon1410a, detail'). Since 
> I have 415,000 observations, I should be dropping 1%*415,000=4,150. 
> Nevertheless, Stata informs me that using the abovementioned command I 
> have dropped 400 observations. Is this all right? Thanks in advance.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index