Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Nick Cox <njcoxstata@gmail.com> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Dopping 1% observations, but numbers do not match |

Date |
Wed, 10 Apr 2013 12:15:05 +0100 |

It's not a good idea to use code you don't understand! I understand you as indicating that you are unclear about what [_N] implies under -by:-. My numbered point #3 put it in words. I wrote a tutorial which is easily accessible (there's a .pdf online, as below), so I won't add to what I have written. SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to move step by: step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox Q1/02 SJ 2(1):86--102 (no commands) explains the use of the by varlist : construct to tackle a variety of problems with group structure, ranging from simple calculations for each of several groups to more advanced manipulations that use the built-in _n and _N http://www.stata-journal.com/sjpdf.html?articlenum=pr0004 Technique for what you are asking is exemplified by sysuse auto su mpg, detail scalar p1 = r(p1) count if mpg <= p1 drop if mpg <= p1 but I can't write it down without flagging that I don't recommend -drop-ping like this. Nick njcoxstata@gmail.com On 10 April 2013 12:01, Miguel Angel Duran <maduran@uma.es> wrote: > Thank you very much, Nick, for your quick answer. Just one additional > questin, if you don't mind. How would I drop unconditionally on the > identifier? And in relation to this (given your answer, just to be sure I > got it right), when an expression like "var[_N]" is used, what does it > exactly mean? Nick Cox > Numerous problems here, at least potentially. > > 0. Dropping outliers defined by an arbitrary threshold is not everyone's > idea of good data analysis practice. If you want comments on what is > "right", this needs defending. > > 1. Just because 0.0388193 is reported as the 1% point does not mean that > exactly 1% of observations have that value or less, even in a situation > where 1% of the number of observations is an integer. There could be ties. > > 2. Precision. 0.0388193 can't be held exactly as a binary number. > Perhaps what is reported as that is really something else, e.g. > > . di %21x 0.0388193 > +1.3e01f8fe83ff0X-005 > > . di %21x 0.03881931 > +1.3e01fe5ce7b79X-005 > > . di %21x 0.03881929 > +1.3e01f3a020467X-005 > > The number of decimal places you see does not correspond to what Stata holds > in storage. > > 3. You are dropping if and only if _all_ values for each identifier are less > than equal to your threshold. But that would leave in the data any such > values if there were a greater value for the same identifier. That is, you > are dropping conditionally on the identifier, not unconditionally. > > Nick > njcoxstata@gmail.com > > On 10 April 2013 11:22, Miguel Angel Duran <maduran@uma.es> wrote: > >> Will you please help me to know that what I am doing is right? To >> eliminate outliers, I am trying to drop 1% of the observations with the > lowest values. >> To do so I use 'bysort entity (rcon1410a): drop if rcon1410a[_N] <= >> 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant >> variable, and 1% of the observations has a value that is lower than >> 0.0388193 (this value is obtained from 'sum rcon1410a, detail'). Since >> I have 415,000 observations, I should be dropping 1%*415,000=4,150. >> Nevertheless, Stata informs me that using the abovementioned command I >> have dropped 400 observations. Is this all right? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Dropping 1% observations, but numbers do not match***From:*"Miguel Angel Duran" <maduran@uma.es>

**References**:**st: Dopping 1% observations, but numbers do not match***From:*"Miguel Angel Duran" <maduran@uma.es>

**Re: st: Dopping 1% observations, but numbers do not match***From:*Nick Cox <njcoxstata@gmail.com>

**RE: st: Dopping 1% observations, but numbers do not match***From:*"Miguel Angel Duran" <maduran@uma.es>

- Prev by Date:
**RE: st: Dopping 1% observations, but numbers do not match** - Next by Date:
**Re: st: difference b/w corr and pwcorr** - Previous by thread:
**RE: st: Dopping 1% observations, but numbers do not match** - Next by thread:
**RE: st: Dropping 1% observations, but numbers do not match** - Index(es):