Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: using -drop if- with weights
Steve Samuels <email@example.com>
Re: st: using -drop if- with weights
Mon, 6 Sep 2010 18:03:32 -0400
One correction: In the -mcd- command the e() option designates the
maximum proportion of expected outliers, not the percentage; "e(20)"
should have been "e(0.2)", but the program did not complain.
My response didn't directly address Luis's concern about the size of
his data set. I think that he can still do robust outlier detection.
Graphical methods such as a the frequency-weighted histogram will
help. So will reducing his data set to the single variable he wants to
investigate and converting or compressing it to the minimum possible
size. In many cases, for example, decimal places will be irrelevant
and little information will be lost by transforming long numbers to
integer type, e.g. 14,320,283 -> 14,320. Obviously it is only
necessary to identify the smallest outlier in order to know which of
the original observations to drop. I agree with Maarten that automatic
rejection of outliers is poor practice.
I have a fairly slow computer (1.5 GHz) and only 2G memory, but I was
able to run -mcd- on 8 million observations in about 30 minutes with
the code below. (-mcd- takes time because it repeatedly samples the
data.) I changed the expected maximum proportion of outliers to a more
realistic 10%; a higher percentage only reduces the run time.
set memory 400m
replace mpg = 50 in 1/5 //5 new outliers
mcd mpg, e(0.10) gen(outlier dist) setseed(5000)
keep if outlier==1
On Mon, Sep 6, 2010 at 6:30 AM, Steve Samuels <firstname.lastname@example.org> wrote:
> Luis must mean "standard deviation", not "standard error", and the SD
> is the statistic that Maarten used. Standard errors are functions of
> sample size, and can be very small, so that almost all observations
> would be dropped. But even with this correction, the process is a very
> bad idea in my opinion. (See section 1.4, p.56 of FR Hampel, et al.
> Robust Statistics: the Approach Based on Influence Functions, Wiley,
> 1986). The standard deviation will be distorted by outliers, making
> detection more difficult, and multiple outliers will mask one another.
> Repeating the process can find and reject new "outliers" at each
> stage, leading to a very unrepresentative sample. Better to use a
> program like the user-written -mcd- or the 20 year-old old -iqr- to
> detect outliers (-findit-), even though neither accepts weights.
> set more off
> sysuse auto, clear
> sum mpg
> list if abs(mpg-r(mean)>3*r(sd))
> replace mpg = 50 in 1/5 //5 new outliers
> sum mpg
> list if abs(mpg-r(mean)>3*r(sd)) //gone!
> capture which mcd
> if _rc net install st0173_1.pkg
> mcd mpg, e(20) gen(outlier rdist) setseed(5000)
> list mpg if outlier //found!
> Steven J. Samuels
> 18 Cantine's Island
> Saugerties NY 12477
> Voice: 845-246-0774
> Fax: 206-202-4783
> On Mon, Sep 6, 2010 at 5:15 AM, Maarten buis <email@example.com> wrote:
>> --- Luis Armando Galvis writes:
>>> I have a question I am stuck with. I need to drop
>>> observations that are beyond 3 standard errors from the mean
>>> of one of the variables. The problem is that using -drop if-
>>> will eliminate observations without taking into account the
>>> weights and will eliminate more observations than needed. I
>>> cannot expand the dataset to 8 million records because of
>>> memory issues. My question is if there is a way to do this
>>> procedure in a more manageable way.
>> The command -drop- doesn't know weights, or allows for weights.
>> It doesn't know the mean or standard deviation either, so the
>> problem is not with -drop- but with what you typed before.
>> Since you did not tell us what you typed before, it is hard for
>> us to comment. Also you did not tell us why you think that your
>> command drops too many observations. This can be crucial
>> information, as the rules of thumb about how many observations
>> should be dropped with such a rule are often based on the normal
>> distribution, but if your variable is severly skewed or has a
>> spike than all bets are off when it comes to predicting how many
>> observations will be dropped with such a rule.
>> On a more fundamental note: such automatic deletion of observation
>> is almost always very very very wrong. Almost always it is the
>> exceptions that contain the most information, so we do not want
>> to throw them away. Think about it from a policy point of view, it
>> is usually the exceptions that we want to attain or prevent: We
>> want the population to live long and healthy and be richt, and want
>> to prevent early deaths, illness, and poverty. It is the extremes
>> that contain information on these events, not the "normal"
>> However, technically this is how you can do it:
>> sum var [fw=w]
>> drop if var < r(mean) - 3*r(sd) | var > r(mean) + 3*r(sd)
>> (assuming that your variables is called var and your weight
>> is called w)
>> Hope this helps,
>> Maarten L. Buis
>> Institut fuer Soziologie
>> Universitaet Tuebingen
>> Wilhelmstrasse 36
>> 72074 Tuebingen
* For searches and help try: