Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: Dropping 1% observations, but numbers do not match |
Date | Wed, 10 Apr 2013 17:16:01 +0100 |
Nick njcoxstata@gmail.com Now I think you are asking for su rcon1410a, detail scalar p1 = r(p1) bysort entity (rcon1410a): drop if rcon1410a[1] <= p1 Necessary that all are <= x implies max <= x Sufficient that one is <= x implies min <= x. Nick On 10 April 2013 16:58, Miguel Angel Duran <maduran@uma.es> wrote: > Thanks again, Nick. I have read your paper: very clear and helpful. In > relation to what you suggest (simplifying: drop if Var <= Number) > is a very straightforward way to get rid of values, but what I want to do > is to eliminate all the subjects that, in a panel data set, in at least one > period have a value that is lower than a threshold. > > Miguel. > > -----Mensaje original----- > De: owner-statalist@hsphsun2.harvard.edu > [mailto:owner-statalist@hsphsun2.harvard.edu] En nombre de Nick Cox > Enviado el: miércoles, 10 de abril de 2013 13:15 > Para: statalist@hsphsun2.harvard.edu > Asunto: Re: st: Dopping 1% observations, but numbers do not match > > It's not a good idea to use code you don't understand! > > I understand you as indicating that you are unclear about what [_N] implies > under -by:-. My numbered point #3 put it in words. I wrote a tutorial which > is easily accessible (there's a .pdf online, as below), so I won't add to > what I have written. > > SJ-2-1 pr0004 . . . . . . . . . . Speaking Stata: How to move step by: > step > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. > Cox > Q1/02 SJ 2(1):86--102 (no > commands) > explains the use of the by varlist : construct to tackle > a variety of problems with group structure, ranging from > simple calculations for each of several groups to more > advanced manipulations that use the built-in _n and _N > > http://www.stata-journal.com/sjpdf.html?articlenum=pr0004 > > Technique for what you are asking is exemplified by > > sysuse auto > su mpg, detail > scalar p1 = r(p1) > count if mpg <= p1 > drop if mpg <= p1 > > but I can't write it down without flagging that I don't recommend -drop-ping > like this. > > Nick > njcoxstata@gmail.com > > On 10 April 2013 12:01, Miguel Angel Duran <maduran@uma.es> wrote: >> Thank you very much, Nick, for your quick answer. Just one additional >> questin, if you don't mind. How would I drop unconditionally on the >> identifier? And in relation to this (given your answer, just to be >> sure I got it right), when an expression like "var[_N]" is used, what >> does it exactly mean? > > Nick Cox > >> Numerous problems here, at least potentially. >> >> 0. Dropping outliers defined by an arbitrary threshold is not >> everyone's idea of good data analysis practice. If you want comments >> on what is "right", this needs defending. >> >> 1. Just because 0.0388193 is reported as the 1% point does not mean >> that exactly 1% of observations have that value or less, even in a >> situation where 1% of the number of observations is an integer. There > could be ties. >> >> 2. Precision. 0.0388193 can't be held exactly as a binary number. >> Perhaps what is reported as that is really something else, e.g. >> >> . di %21x 0.0388193 >> +1.3e01f8fe83ff0X-005 >> >> . di %21x 0.03881931 >> +1.3e01fe5ce7b79X-005 >> >> . di %21x 0.03881929 >> +1.3e01f3a020467X-005 >> >> The number of decimal places you see does not correspond to what Stata >> holds in storage. >> >> 3. You are dropping if and only if _all_ values for each identifier >> are less than equal to your threshold. But that would leave in the >> data any such values if there were a greater value for the same >> identifier. That is, you are dropping conditionally on the identifier, not > unconditionally. >> >> Nick >> njcoxstata@gmail.com >> >> On 10 April 2013 11:22, Miguel Angel Duran <maduran@uma.es> wrote: >> >>> Will you please help me to know that what I am doing is right? To >>> eliminate outliers, I am trying to drop 1% of the observations with >>> the >> lowest values. >>> To do so I use 'bysort entity (rcon1410a): drop if rcon1410a[_N] <= >>> 0.0388193'. Note that 'entity' is id, 'rcon1410a' is the relevant >>> variable, and 1% of the observations has a value that is lower than >>> 0.0388193 (this value is obtained from 'sum rcon1410a, detail'). >>> Since I have 415,000 observations, I should be dropping 1%*415,000=4,150. >>> Nevertheless, Stata informs me that using the abovementioned command >>> I have dropped 400 observations. Is this all right? * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/