There is no assumption regarding the distribution of the
actual data in linear regression; however, for the p-values
and confidence intervals to be meaningful, the residuals
of the regression must be normally distributed
if you have estimated a model assuming linearity
of continuous terms and no interactions (i.e.,
assuming addivity), then the distribution of the
residuals often mimics the distribution of the
left-hand-side variable (here, los) -- however, if
you include non-linear terms (e.g., polynomials or
splines) or if you include any interactions, then
the distribution of the residuals can be quite
different from the distribution of the left-hand-side
variable
if you decide to transform the left-hand-side with logs,
there are user-written procedures to help with the
interpretation -- use -search-
another alternative is to not transform, and use -glm-
and use a log link
hope this helps,
Rich
Ashwin Ananthakrishnan wrote:
Hi,
I have a model where the outcome is length of stay
(los). This variable has some right skew and is not
perfectly 'normal'.
Is it valid for me to run linear regression of other
predictors on length of stay if the los is not
normally distributed?
If it is not valid, then log (los) is a normally
distributed variable. But how do I interpret the
coefficients of the log(los). I find that
exponentiating log(los) coefficient doesn't seem to be
appropriate as it doesn't yield valid results. For
example p>0.05, but the 95% CI don't overlap 'zero'
which is what I would expect in linear regression.
Also exp(log(los)) doesn't give a similar estimate as
the coefficients if I run the regression on los
directly.
I apologize in advance if my question is either to
basic or difficult to understand.
Thank you.
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/