Paul:
I have two remarks:
First, the distribution of the dependent variable isn't relevant, the
distribution of the errors is. So the fact that the dependent variable
is skewed is no reason to transform that variable, only inspection of
the residuals is. Have a look -help regress postestimation- for lots of
useful commands.
Second, if you want to automatically search for some transformation of
the dependent variable that -boxcox- is already preprogrammed saving
you some trouble. However, I don't like these automated search
techniques: they tend to make people stop thinking for themselves. For
instance, if you keep the dependent variable as is, you think that a
unit change in your explanatory variable causes a given number of
dollars change in out of pocket spending. However if you log transform
the dependent variable, you think that out of pocket spending changes
by some given percentage for a unit change in the explanatory variable.
Choosing between these two on substantive grounds would be more
satisfactory for me.
HTH,
Maarten
--- paul d jacobs <[email protected]> wrote:
> I am working with health data (MEPS) where
> out-of-pocket medical expenses (OOP) are a dependent
> variable in an OLS regression. Because of the
> positive skewness of such a variable, I would like to
> use a normalizing transformation, i.e. the log of OOP.
> However, because of the many zero observations for
> OOP, the options are to either add a constant to OOP,
> (some have used $1 arbitrarily), or to model the data
> separately for the zeroes and the positive values,
> which I'd rather not do. (I have also considered the
> square root transformation, etc., but would like to
> test out the results using a log-constant).
>
> My question is: do you know of a method for searching
> for the optimal constant to add to a variable so that
> a log-transformation produces the optimal result? Deb
> et al. (2005), suggest a 'grid search' for this value
> (see link below for document). I know that grid
> searches are used in the context of maximum
> likelihood; is this a similar process? Would running
> the model with different values and comparing R2s and
> standard errors be more appropriate?
-----------------------------------------
Maarten L. Buis
Department of Social Research Methodology
Vrije Universiteit Amsterdam
Boelelaan 1081
1081 HV Amsterdam
The Netherlands
visiting adress:
Buitenveldertselaan 3 (Metropolitan), room Z434
+31 20 5986715
http://home.fsw.vu.nl/m.buis/
-----------------------------------------
Send instant messages to your online friends http://uk.messenger.yahoo.com
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/