[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Correction for bias in regression estimates after log transformation

From   "Nick Cox" <>
To   <>
Subject   RE: st: Correction for bias in regression estimates after log transformation
Date   Wed, 17 Dec 2008 13:06:19 -0000

The issue as I understand it for response y arises because the mean of
log(y) differs from the log of mean(y). What you do to the predictors is
immaterial. The problem is generic to any nonlinear transformation. 

I see there being two main relatively simple ways of tackling this
problem. (There are other more complicated methods; my experience, such
as it is, indicates that they don't give very different results except
when results are highly dubious anyway.) 

1. Avoid it altogether by using -glm- with appropriate link. 

2. Use smearing. 

Richard Goldstein implemented -predlog- in 1996, which includes

STB-29  sg48  .  Predictions in the original metric for log-transformed
        (help predlog if installed) . . . . . . . . . . . . . . . R.
        1/96    pp.27--29; STB Reprints Vol 5, pp.145--147
        calculates three different retransformations, which allow
        obtaining predictions in the original metric

Both the software and the original article are accessible to all. 

You can almost do smearing by hand, but here is a slightly more polished
version of doing it by hand. 

*! NJC 2.1.0 8 January 2005 
* NJC 1.0.0 13 September 2002 
program smear, rclass  
	version 8.0
	syntax [if] [in] [, Generate(str) OUTofsample ]  

	if "`generate'" != "" { 
		capture confirm new variable `generate' 
		if _rc {
			di as err "option syntax is generate(newvar)" 
			exit _rc 

	marksample touse 
	qui count if `touse' 
	if r(N) == 0 error 2000
	tempvar resid yhatraw
	tempname rmse cf 

	qui { 
		* will exit with error message if no estimates 
		scalar `rmse' = e(rmse)
		if "`outofsample'" != "" predict double `yhatraw' 
		else predict double `yhatraw' if e(sample) 
		predict double `resid', res
		replace `resid' = exp(`resid') 
		su `resid', meanonly 
		scalar `cf' = r(mean) 

		if "`generate'" != "" { 
			gen double `generate' = exp(`yhatraw') * `cf' if
			la var `generate' "smeared retransformation"

	di as res scalar(`cf') 
	return scalar smearcf = `cf' 

There is more discussion in 

N.J. Cox, J. Warburton, A. Armstrong and V.J. Holliday. 2008. Fitting
concentration and load rating curves with generalised linear models.
Earth Surface Processes and Landforms 33: 25-39 (doi: 10.1002/esp.1523)

which may be accessible to you. 


Maarten buis

--- "Loncar, Dejan" <> wrote:
> I have transformed the variables using log function before
> regression.
> Do you know by any chance which function in Stata or some ado file
> can perform antilog transformation after regression with correction
> for bias in regression estimates? 

Bias means nothing else than that your estimates don't mean what you
think they mean. So there are two ways of addressing bias: Either you
change interpretation of the results so that the interpretation
corresponds to the estimate, or you change your estimate so that it
measures what you think it does. Another consequence of this is that
there is no such thing as a biased estimate perse: you always need to
specify what the estimate is a biased estimate of. Trivially all
estimates are biased estimates of most concepts (e.g. the annual tea
consumption of Burundi is a biased estimate of the number of ants per
square inch in Amsterdam), and at the same time all estimates are
unbiased estimates of the thing that they measure (but the thing they
measure may not be of interest).

The distinction between changing the interpretation and changing the
estimate is nicely illustrated by looking at a log transformed
dependent variable. If you fist transform the dependent variable and
than perform a regular regression you can interpret the exponentiated
coefficients as ratios of geometric means, but not as ratios of
arithmatic means. You can get estimates in terms of ratios of
arithmatic means when you use -glm- on the untransformed dependent
variable with -link(log)- option. So if you are interested in the
effect on the geometric mean, then -glm- will provide you with biased
estimates. You can solve this either by changing your interpretation of
the results to the effect in terms of the arithmatic mean or by
estimating your model with -regress-. 

I have discussed a detailed example of this issue here:

Also see:
Roger Newson (2003) Stata tip 1: The eform() option of regress. The
Stata Journal 3(4): 445.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index