Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Dopping 1% observations, but numbers do not match |

Date |
Wed, 10 Apr 2013 11:47:25 +0100 |

Numerous problems here, at least potentially. 0. Dropping outliers defined by an arbitrary threshold is not everyone's idea of good data analysis practice. If you want comments on what is "right", this needs defending. 1. Just because 0.0388193 is reported as the 1% point does not mean that exactly 1% of observations have that value or less, even in a situation where 1% of the number of observations is an integer. There could be ties. 2. Precision. 0.0388193 can't be held exactly as a binary number. Perhaps what is reported as that is really something else, e.g. . di %21x 0.0388193 +1.3e01f8fe83ff0X-005 . di %21x 0.03881931 +1.3e01fe5ce7b79X-005 . di %21x 0.03881929 +1.3e01f3a020467X-005 The number of decimal places you see does not correspond to what Stata holds in storage. 3. You are dropping if and only if _all_ values for each identifier are less than equal to your threshold. But that would leave in the data any such values if there were a greater value for the same identifier. That is, you are dropping conditionally on the identifier, not unconditionally. Nick njcoxstata@gmail.com On 10 April 2013 11:22, Miguel Angel Duran <maduran@uma.es> wrote: > Will you please help me to know that what I am doing is right? To eliminate > outliers, I am trying to drop 1% of the observations with the lowest values. > To do so I use 'bysort entity (rcon1410a): drop if rcon1410a[_N] <= > 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant variable, > and 1% of the observations has a value that is lower than 0.0388193 (this > value is obtained from 'sum rcon1410a, detail'). Since I have 415,000 > observations, I should be dropping 1%*415,000=4,150. Nevertheless, Stata > informs me that using the abovementioned command I have dropped 400 > observations. Is this all right? Thanks in advance. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Dopping 1% observations, but numbers do not match***From:*"Miguel Angel Duran" <maduran@uma.es>

**References**:**st: Dopping 1% observations, but numbers do not match***From:*"Miguel Angel Duran" <maduran@uma.es>

- Prev by Date:
**st: Dopping 1% observations, but numbers do not match** - Next by Date:
**st: Size of dots in a scatter plot** - Previous by thread:
**st: Dopping 1% observations, but numbers do not match** - Next by thread:
**RE: st: Dopping 1% observations, but numbers do not match** - Index(es):