# Re: st: transformations for highly skewed dependent variable

 From Austin Nichols To statalist@hsphsun2.harvard.edu Subject Re: st: transformations for highly skewed dependent variable Date Thu, 10 Sep 2009 20:00:51 -0400

```Michael Crain<michaelcrain@hotmail.com> :
First, the original poster asked about transforming y to change its
distribution, not the distribution of errors or residuals.  If X is a
set of highly skewed lognormal variables, and y=Xb+e, with some
elements of b negative and some positive, it may well be the case that
y has a high peak near zero and long tails, even with e distributed
standard normal, e.g.

clear
drawnorm x1 x2 x3 x4 e, clear seed(1) n(1000)
foreach v of var x* {
replace `v'=exp(3*`v')
}
g y=x1+x2-x3-x4+e
tw kdensity y, name(y)
qui reg y x*
predict r, res
tw kdensity r, name(res)

But you are asking about transforming y after a regression, where you
have looked at the residuals, I think. This still presents problems in
the general case. Note the transformation is done not to give normal
errors, but more normal residuals.  So y is transformed after fitting
a model, and that model is not driven by strong theory (if it were, no
transformation would be considered). This form of specification search
can introduce bias, e.g. if the target is estimating the mean marginal
effect of some X on y.  There are many possible violations of
normality for errors, and only a fraction call for transforming y. If
the model is misspecified, or a regressor is omitted, it's natural to
think residuals need not be normal even if the errors in the true
model are. If errors are non-normal because they are a mixture of
normals, then perhaps heteroskedasticity is the issue, and
transforming y may not produce the desired result at all.  Implicit in
my code snippet in the post you quoted was some kind of categorical
heteroskedasticity (thinking of firms of different sizes earning
average returns with very different distributions from the same
family).

The question "Does the economics field look past some of the GLM
assumptions?" seems to imply some slight on the economics field, as if
econometricians ignore assumptions, when in fact they are if anything
too focused on them.  Or perhaps you had something else in mind
altogether?  And what is off topic?

On Thu, Sep 10, 2009 at 7:17 PM, Michael Crain<michaelcrain@hotmail.com> wrote:
> Well, sure, there are a lot of possible transformations e.g.
> arctangent or cube root, but what is the purpose of the
> transformation?  Are you regressing y on X and thinking the errors
> won't be normal?  In that case, you may not want to transform y.
> Also, have you considered that the y~=0 obs might be somehow
> qualitatively different?  Note that the sd of return should be
> conditioned on size of investment, at least...

This is a bit off topic. I believe you are suggesting that transforming
variables to address non-normal errors is not so important in this
case of an economic data set. Can you explain why? Does the
economics field look past some of the GLM assumptions?

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```