Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: using -drop if- with weights

From   Steve Samuels <>
Subject   Re: st: using -drop if- with weights
Date   Mon, 6 Sep 2010 06:30:35 -0400

 Luis must mean "standard deviation", not "standard error", and the SD
is the statistic that Maarten used.  Standard errors are functions of
sample size, and can be very small, so that almost all observations
would be dropped. But even with this correction, the process is a very
bad idea in my opinion. (See section 1.4, p.56 of FR Hampel, et al.
Robust Statistics: the Approach Based on Influence Functions, Wiley,
1986). The standard deviation will be distorted by outliers, making
detection more difficult, and multiple outliers will mask one another.
 Repeating the process can find and reject new "outliers" at each
stage, leading to a very unrepresentative sample. Better to use a
program like the user-written -mcd- or the 20 year-old old -iqr- to
detect outliers (-findit-), even though neither accepts weights.

set more off
sysuse auto, clear
sum mpg
list if abs(mpg-r(mean)>3*r(sd))
replace mpg = 50 in 1/5 //5 new outliers
sum mpg
list if abs(mpg-r(mean)>3*r(sd)) //gone!

capture which mcd
if _rc net install st0173_1.pkg

mcd mpg, e(20) gen(outlier rdist) setseed(5000)
list mpg if outlier //found!

Steven J. Samuels
18 Cantine's Island
Saugerties NY 12477
Voice: 845-246-0774
Fax:    206-202-4783

On Mon, Sep 6, 2010 at 5:15 AM, Maarten buis <> wrote:
> ---  Luis Armando Galvis writes:
>> I have a question I am stuck with. I need to drop
>> observations that are beyond 3 standard errors from the mean
>> of one of the variables. The problem is that using -drop if-
>> will eliminate observations without taking into account the
>> weights and will eliminate more observations than needed. I
>> cannot expand the dataset to 8 million records because of
>> memory issues. My question is if there is a way to do this
>> procedure in a more manageable way.
> The command -drop- doesn't know weights, or allows for weights.
> It doesn't know the mean or standard deviation either, so the
> problem is not with -drop- but with what you typed before.
> Since you did not tell us what you typed before, it is hard for
> us to comment. Also you did not tell us why you think that your
> command drops too many observations. This can be crucial
> information, as the rules of thumb about how many observations
> should be dropped with such a rule are often based on the normal
> distribution, but if your variable is severly skewed or has a
> spike than all bets are off when it comes to predicting how many
> observations will be dropped with such a rule.
> On a more fundamental note: such automatic deletion of observation
> is almost always very very very wrong. Almost always it is the
> exceptions that contain the most information, so we do not want
> to throw them away. Think about it from a policy point of view, it
> is usually the exceptions that we want to attain or prevent: We
> want the population to live long and healthy and be richt, and want
> to prevent early deaths, illness, and poverty. It is the extremes
> that contain information on these events, not the "normal"
> observations.
> However, technically this is how you can do it:
> sum var [fw=w]
> drop if var < r(mean) - 3*r(sd) | var > r(mean) + 3*r(sd)
> (assuming that your variables is called var and your weight
> is called w)
> Hope this helps,
> Maarten
> --------------------------
> Maarten L. Buis
> Institut fuer Soziologie
> Universitaet Tuebingen
> Wilhelmstrasse 36
> 72074 Tuebingen
> Germany
> --------------------------

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index