Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: winsorization and normality

From   "Nick Cox" <>
To   <>
Subject   st: RE: winsorization and normality
Date   Wed, 23 Jun 2004 00:33:13 +0100

Dear Gary: 

By accident or design you reply to my reply, but 
you don't focus on the kind of issue it raises. 

As I understand it, you can reduce your problem of 
non-normality by attacking the parts of the 
data you find least convenient and changing 
them! The ancient myth of the hotelier Procrustes 
who chopped and stretched his unfortunate guests to fit
the beds on offer springs to mind. What's uppermost 
here, jumping through hoops to attain respectable 
P-values, or trying to promote statistical science? 

Put in more conventional and less histrionic terms, 
what precisely is the non-normality "problem" you have? 
A simple example, nothing to do with residuals or time 
series, but illustrative of the key difficulty,  
is provided by the auto data. If you go 

foreach v of var price-gear { 
	swilk `v' 

you will see that various variables qualify as non-normal 
according to conventional significance levels. But
this means mostly that the sample size is large enough 
to detect some non-normality, not that the non-normality
is large enough to be problematic for any purpose
of data analysis. (In other words, the results exemplify 
a standard limitation of significance tests.) In fact, 
to pick up one example, a careful look at -gear-ratio- 
by e.g. 

qnorm gear_ratio 

shows that despite the P-value of 0.01525 this 
variable has a distribution which in practice 
would not be problematic if it were a distribution 
of residuals. (The P-value I put down partly to some 
granularity, certainly not outliers or fat tails.) 
And the n = 74 of the auto data is pretty modest 
by most people's standards: the issue will be 
compounded in larger datasets. My guess is 
that with your kind of data you have a much 
larger n.

Incidentally, chopping according to 
a multiple of the SD is not Winsorization, 
as I pointed out on Sunday in reply 
to a previous posting of yours. More 
importantly, replacing a distribution 
longer-tailed than normal with one 
shorter-tailed than normal may well lead 
to rejections of normality too, depending
precisely on what test you are using... 


gary tian
> Further to John's question regarding trimming, I would like 
> to raise the
> following question to seek your help.
> I and testing cointegration and causality for daily return of 
> share indices
> time series (first log difference) data based on VAR model. 
> whatever I put
> different lag of each variable, I found there is still 
> non-normality exist
> in the time series by residual test. I applied sort of 
> winsorization in
> which the returns are winsorized by replacing all returns 
> outside the range
> [mean +/- standard deviations] with these boundary values. 
> the problems of
> non-normality has been largely improved but still existed. the Second
> method, I found it is more effective is using monthly and 
> quarterly data,
> the problem is losing the original meaning of integration in 
> precise number
> of days. Are these standard ways to treat the problem, or is 
> there any other
> better way? 

Nick Cox
> I guess there's a literature on this somewhere,
> but it doesn't seem that trimming of tails
> before regression ever caught on as standard practice
> (unless there's a subdiscipline that does it all the
> time, as a living refutation of this guess).
> The key question to me is what is your underlying
> problem? Worrying about long tails is often
> best met by quantile or robust regression or using
> transformations or non-identity link functions.
> Far simpler and better supported than tinkering
> with the tails...
Rijo John

> >  I have a data set with quite a few outliers. Suppose I am 
> trimming my
> > dependent  variable 1% each from top and bottom using 1st and 99th
> > percentiles. And I have the regression estimates before and after
> > trimming. Let us also suppose that some of the variables that were
> > significant before trimming turned out to be insignificant
> > after trimming
> > and/or viceversa.
> >
> >  Is there a standard way by which one can decide how much percentage
> > of data should be trimmed? Is a chow test for the equality of
> > coefficients
> > enough for this? I mean trim upto the point where the changes in
> > coefficients becomes insignificant? Or is there any other
> > standard way to
> > do this?

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index