Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Steve Samuels <sjsamuels@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: using -drop if- with weights |

Date |
Mon, 6 Sep 2010 18:03:32 -0400 |

-- One correction: In the -mcd- command the e() option designates the maximum proportion of expected outliers, not the percentage; "e(20)" should have been "e(0.2)", but the program did not complain. My response didn't directly address Luis's concern about the size of his data set. I think that he can still do robust outlier detection. Graphical methods such as a the frequency-weighted histogram will help. So will reducing his data set to the single variable he wants to investigate and converting or compressing it to the minimum possible size. In many cases, for example, decimal places will be irrelevant and little information will be lost by transforming long numbers to integer type, e.g. 14,320,283 -> 14,320. Obviously it is only necessary to identify the smallest outlier in order to know which of the original observations to drop. I agree with Maarten that automatic rejection of outliers is poor practice. I have a fairly slow computer (1.5 GHz) and only 2G memory, but I was able to run -mcd- on 8 million observations in about 30 minutes with the code below. (-mcd- takes time because it repeatedly samples the data.) I changed the expected maximum proportion of outliers to a more realistic 10%; a higher percentage only reduces the run time. Steve ***************************** clear clear matrix set memory 400m sysuse auto keep mpg compress replace mpg = 50 in 1/5 //5 new outliers expand 109000 mcd mpg, e(0.10) gen(outlier dist) setseed(5000) tab outlier keep if outlier==1 sum mpg ******************************** On Mon, Sep 6, 2010 at 6:30 AM, Steve Samuels <sjsamuels@gmail.com> wrote: > -- > Luis must mean "standard deviation", not "standard error", and the SD > is the statistic that Maarten used. Standard errors are functions of > sample size, and can be very small, so that almost all observations > would be dropped. But even with this correction, the process is a very > bad idea in my opinion. (See section 1.4, p.56 of FR Hampel, et al. > Robust Statistics: the Approach Based on Influence Functions, Wiley, > 1986). The standard deviation will be distorted by outliers, making > detection more difficult, and multiple outliers will mask one another. > Repeating the process can find and reject new "outliers" at each > stage, leading to a very unrepresentative sample. Better to use a > program like the user-written -mcd- or the 20 year-old old -iqr- to > detect outliers (-findit-), even though neither accepts weights. > > ***************************** > set more off > sysuse auto, clear > sum mpg > list if abs(mpg-r(mean)>3*r(sd)) > replace mpg = 50 in 1/5 //5 new outliers > sum mpg > list if abs(mpg-r(mean)>3*r(sd)) //gone! > > capture which mcd > if _rc net install st0173_1.pkg > > mcd mpg, e(20) gen(outlier rdist) setseed(5000) > list mpg if outlier //found! > ******************************** > Steve > > Steven J. Samuels > sjsamuels@gmail.com > 18 Cantine's Island > Saugerties NY 12477 > USA > Voice: 845-246-0774 > Fax: 206-202-4783 > > On Mon, Sep 6, 2010 at 5:15 AM, Maarten buis <maartenbuis@yahoo.co.uk> wrote: >> --- Luis Armando Galvis writes: >>> I have a question I am stuck with. I need to drop >>> observations that are beyond 3 standard errors from the mean >>> of one of the variables. The problem is that using -drop if- >>> will eliminate observations without taking into account the >>> weights and will eliminate more observations than needed. I >>> cannot expand the dataset to 8 million records because of >>> memory issues. My question is if there is a way to do this >>> procedure in a more manageable way. >> >> The command -drop- doesn't know weights, or allows for weights. >> It doesn't know the mean or standard deviation either, so the >> problem is not with -drop- but with what you typed before. >> Since you did not tell us what you typed before, it is hard for >> us to comment. Also you did not tell us why you think that your >> command drops too many observations. This can be crucial >> information, as the rules of thumb about how many observations >> should be dropped with such a rule are often based on the normal >> distribution, but if your variable is severly skewed or has a >> spike than all bets are off when it comes to predicting how many >> observations will be dropped with such a rule. >> >> On a more fundamental note: such automatic deletion of observation >> is almost always very very very wrong. Almost always it is the >> exceptions that contain the most information, so we do not want >> to throw them away. Think about it from a policy point of view, it >> is usually the exceptions that we want to attain or prevent: We >> want the population to live long and healthy and be richt, and want >> to prevent early deaths, illness, and poverty. It is the extremes >> that contain information on these events, not the "normal" >> observations. >> >> However, technically this is how you can do it: >> >> sum var [fw=w] >> drop if var < r(mean) - 3*r(sd) | var > r(mean) + 3*r(sd) >> >> (assuming that your variables is called var and your weight >> is called w) >> >> Hope this helps, >> Maarten >> >> -------------------------- >> Maarten L. Buis >> Institut fuer Soziologie >> Universitaet Tuebingen >> Wilhelmstrasse 36 >> 72074 Tuebingen >> Germany >> >> http://www.maartenbuis.nl >> -------------------------- >> >> >> > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: using -drop if- with weights***From:*Nick Cox <n.j.cox@durham.ac.uk>

**Re: st: using -drop if- with weights***From:*Maarten buis <maartenbuis@yahoo.co.uk>

**Re: st: using -drop if- with weights***From:*Steve Samuels <sjsamuels@gmail.com>

- Prev by Date:
**RE: st: Basic Numeric to String Recoding** - Next by Date:
**Re: st: Estimating the (possibly negative) intracluster correlation** - Previous by thread:
**Re: st: using -drop if- with weights** - Next by thread:
**st: Problems with qrreg** - Index(es):