Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Determining h or p in winsor command

From   Nick Cox <[email protected]>
To   [email protected]
Subject   Re: st: Determining h or p in winsor command
Date   Fri, 30 Sep 2011 00:10:40 +0100

As you say, I wrote this (on SSC), but I don't use it. So, I suspect
that someone asked for a program and it seemed an easy thing to
provide. Download data suggest that it is one of my most popular
packages, but it doesn't generate much correspondence either privately
or onlist, so I have no real idea how it is used. My vague impression
is that Winsorizing is most used by economists.

I find the whole idea of automating it fairly dubious. On the
contrary, the most responsible way to use it would be to try different
amounts of Winsorizing to see how sensitive results were to that. If
it turned out that they weren't, that would end the story. If it were,
it would be a matter of thinking where to go next.

Amanda's mentioned style is to Winsorize using a convention commonly
used to identify "outliers" on boxplots (although, to start a
different topic, I now tend to prefer identifying points beyond the
1/99 or 5/95 percentiles on the odd occasion I draw boxplots). That
seems fairly drastic to me.

If outliers are a concern, and appear genuine, I would usually prefer
to think of a non-identity link function (for responses) or
transformation (for predictors), which usually seems natural on other
grounds anyway.


On Thu, Sep 29, 2011 at 8:17 PM, Amanda Balzer
<[email protected]> wrote:
> I am using Nick Cox's -winsor- command to clean outliers in my data and wondered what the rule of thumb is in determining either the p (fraction of observations) or h (number of observations) to enter in the command to be winsorized. Do you simply view scatterplots and count observations? This seems problematic with large datasets.
> When calculating the values for winsorizing by hand (which I was doing before this command), I would simply set all values greater/less than the upper and lower Tukey's hinges +/- 1.5*spread to the said value. The -winsor-  command does a similar computation but doesn't automatically set the too high and too low values to the determined minimum and maximum. How does one determine p or h?

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index