# Re: st: getting realistic fitted values from a regression

 From Steven Samuels To statalist@hsphsun2.harvard.edu Subject Re: st: getting realistic fitted values from a regression Date Fri, 23 Jul 2010 12:28:31 -0400

The original poster said nothing about testing the fit of his model, even for the original log response, for example with -linktest- diagnostic plots, interactions, substitution of fractional polynomials or splines for linear terms. Perhaps a poor model accounts for some of the difficulties.

Steve

On Jul 23, 2010, at 11:53 AM, Austin Nichols wrote:

Nick, Kit, et al.--
The other fixes can work really badly in the presence of non-lognormal
errors and/or heteroskedasticity, but -glm- or -poisson- still works
well, as pointed out in:
http://www.stata.com/meeting/boston10/boston10_nichols.pdf

In fact, I think the claim in the -levpredict- package is too strong:
"These predictions avoid the retransformation bias that arises when
predictions of the log dependent variable are exponentiated.  See
Cameron and Trivedi, MUS, 2009, 3.6.3."

Note that even MUS claims only "a weaker assumption is to assume that
u_i is i.i.d., in which case we can consistently estimate E[exp(u)] by
the sample average of exp(\hat{u}); see Duan(1983)" which is quite
distinct from avoiding retransformation bias in a non-iid setting, and
furthermore makes no claim about minimizing root mean square
prediction error, or RMSE of marginal effects, which presumably is the
goal of Woolton Lee.

Consistent estimation of the exponentiated error gets your mean
prediction closer to the mean of the outcome in levels, but still not
as close as -poisson- or -glm-, and does not guarantee that
predictions in levels for individual cases are particularly good.

On Fri, Jul 23, 2010 at 11:06 AM, Nick Cox <n.j.cox@durham.ac.uk> wrote:

Thanks for the commendation.

It is easy enough to try the -glm- approach _and_ other fixes and to
compare results.

I have found that they give very similar answers in practice. What all

can agree on is that some kind of fix is needed when your real interest
is predicting on the original scale and a log scale -- or indeed any
other nonlinear transform or link -- was used for the response in
modelling.

Nick
n.j.cox@durham.ac.uk

David Jacobs

Maarten states the received wisdom on this issue, but in the
econometrics text authored by Jeffrey Wooldridge (Introductory
Econometrics Thompson-Southwestern 2003 ) on pp. 208-9 Wooldridge
suggests a way to obtain unlogged predictions from a regression in
which the regressand is in log form (there have been subsequent
editions of this book but the page numbers I give will be close in
those newer editions).  If one of the statistical experts on this
list is familiar with this approach or is willing to look it up, I'd
be interested in their reaction.

That said, I wholeheartedly agree with Maarten's recommendation.  I
found the article he suggests by Cox et al. to be extremely useful
and I'm grateful to him for suggesting it on another occasion.

David Jacobs

At 03:08 AM 7/22/2010, you wrote:

--- On Wed, 21/7/10, Woolton Lee wrote:

I have estimated a regression (OLS) using log of patient
travel distance to a hospital predicted by patient, hospital
and area characteristics.  I am going to report the results
as marginal effects that I've computed by obtaining
predictions from my estimated regression computed by fixing
some variables and keeping others at their original values.
However after I compute the predictions, I am getting
unrealistically large numbers.  When I examined the regression
residuals it looks as though the obs with unrealistic fitted
values have larger residuals.  Is there a way to adjust the
regression to better account for this problem?


If you want to predict the travel distance you should use
-glm- with -link(log)- option rather than use -regress- on
a log transformed dependent variable. The difference is that
with the former you are modeling log(E(y)), while in the latter
you are moddeling E(log(y)). If you want to backtransform your
predictions using the antlog transformation you will get
exp(log(E(y))) = E(y) for the -glm- command, while after -regress
you get exp(E(log(y))) != E(y). A nice discussion on this issue
can be found in:


Nicholas J. Cox, Jeff Warburton, Alona Armstrong, Victoria J. Holliday
(2007) "Fitting concentration and load rating curves with generalized
linear models" Earth Surface Processes and Landforms, 33(1):25--39.
<http://www3.interscience.wiley.com/journal/114281617/abstract>

There exist approximations you can use after -regress- to fix
this problem, by why try to fix a problem if you can easily prevent
it?


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/