Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: OLS assumptions not met: transformation, gls, or glm as solutions?

 From David Hoaglin To statalist@hsphsun2.harvard.edu Subject Re: st: OLS assumptions not met: transformation, gls, or glm as solutions? Date Mon, 17 Dec 2012 07:33:13 -0500

```Laura,

When you plotted the dependent variable against the predictor
variables, what patterns of curvature (if any) did you see?  You
didn't mention the number of observations.  If it is large, you my
want to use LOWESS to trace smooth curves through those plots.

You can also look for curvature in the plots of the studentized
residuals against the individual predictor variables, and a plot of
those residuals against the predicted values will give you information
on the pattern of heteroskedasticity.

Often, transforming the dependent variable helps to straighten the
relations between the dependent variable and the predictors, AND it
also stabilizes the variability in the dependent variable.  It is
likely that the variability in the number of minutes spent on the
activity increases as the expected number of minutes increases.

Two other transformations to consider are the square root and the
reciprocal would transform slowness into fastness.)  If the logarithm
is the most reasonable choice, it is not necessary to make
interpretation more difficult by using the natural log.  Use logs base
10 instead.  With either base, interpretation is in terms of ratios,
which is often not difficult.

After a suitable transformation you may have fewer outliers (or none).
You should be cautious in excluding outliers and, especially,
influential observations.

If you included the zeros and used a tobit model, you would still have
to do something about curvature and heteroskedasticity.

David Hoaglin

On Mon, Dec 17, 2012 at 5:43 AM, Laura R. <laura.roh@googlemail.com> wrote:
> Dear Stata users,
>
> I estimated an OLS model with the number of minutes (1-1440) spent on
> an activity on a day as dependent variable. At first sight, the model
> works fine. I receive some interesting results which are robust across
> model specifications. I would like to keep it as it is, but:
>
> - The regression diagnostics shows that the error terms are not
> normally distributed, but right skewed.
>
> - In addition, there is heteroskedasticity.
>
> Excluding outliers and influential cases does not help. Now I can
> think about 4 solutions, but I am not sure when it is justified to
> decide on one of these:
>
> 1. Keep the model and the variables as they are (but maybe use robust
> standard errors) - is this possible under certain conditions, even if
> I have heteroskedasticity and non-normality of residuals, and when is
> this justified?
>
> 2. Transform the dependent variable. If I take the ln of the dependent
> variable, the residuals get closer to a normal distribution, and it
> gets closer to homoskedasticity. But then there is the problem of
> interpreting the results.
>
> 3. Generalised least square model (gls): Use this instead. This is a
> solution to heteroskedasticity, but do the residuals have to be
> normally distributed in gls as well? What other new assumptions of gls
> might cause new problems (pros/cons gls vs. OLS)? And how can I do
> this in Stata? (Somehow with calculating a weight, I think...)
>
> 4. Generalised linear model (glm): In some sources I read that this
> also accounts for heteroskedasticity, in other sources not. Again,
> what about the normal distribution of residuals here? I heard that glm
> is better than OLS for non-negative dependent variables, is that
> correct? What are other assumptions of gls that could make me still
> prefer OLS? If I used it ,and if my dependent variable is
> non-negative, and residuals are right skewed, do I have to "tell" that
> Stata when estimating the model, or can I run it as it is?
>
> (I quickly ran -glm- already, without any special specifications, and
> the results are the same as from the OLS model.)
>
> In sum, I need some decision-making support. What is the best thing to
> do in this case?
> One thing that would help is a comparison of assumptions of OLS, gls,
> glm. I am aware of the assumptions of OLS models, but for gls and glm
> I did not find comprehensive lists and explanations.
>
> It would be great if you could give me hints on what would be a good
> solution. Maybe you know a source explaining when to use which
> solution if OLS assumptions of normality and homoskedasticity are not
> met.
>
> Laura
>
>
>
> PS: I am aware of the fact that many used Tobit for similar dependent
> variables, including the zeros. My case is different, and for some
> reason I do not want to do this, and I excluded the zeros.
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```