[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Austin Nichols <austinnichols@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Handling 0 values when using logs of a dependent variable |

Date |
Wed, 15 Jul 2009 13:48:37 -0400 |

Rich Steinberg<rsteinbe@iupui.edu> : Not so--for many purposes, adding a constant to all observations is worse, as demonstrated by the attached simulation. Making the missings (logged zeros) a number smaller than all others and running -tobit- can work quite well if you pick the right number--but unless you know the right answer already, you cannot be assured of picking the right number, also demonstrated in the simulation below, and you will often wind up worse off! On the other hand, the -poisson- approach, or equivalent -glm- call, works more generally, in the sense of giving a distribution of estimated coefficients centered near the true value and having a size of test (rejection rate) better than the alternatives (closer to 5%). clear all prog simlny, rclass drawnorm x e, clear n(1000) g y=round(exp(x+e),.5) count if y==0 return scalar n0=r(N) poisson y x, r return scalar pois=_b[x] test x=1 return scalar p_pois=r(p) return scalar r_pois=r(p)<.05 g lny=ln(y+1) reg lny x, r return scalar reg=_b[x] test x=1 return scalar p_reg=r(p) return scalar r_reg=r(p)<.05 tobit lny x, ll r return scalar tobit=_b[x] test x=1 return scalar p_tobit=r(p) return scalar r_tobit=r(p)<.05 g lny2=max(ln(y),ln(.25)) tobit lny2 x, ll r return scalar t2=_b[x] test x=1 return scalar p_t2=r(p) return scalar r_t2=r(p)<.05 g lny3=max(ln(y),ln(.025)) tobit lny3 x, ll r return scalar t3=_b[x] test x=1 return scalar p_t3=r(p) return scalar r_t3=r(p)<.05 eret clear end simul,rep(100) seed(1): simlny tw kdensity pois||kdensity reg||kdensity tobit tw kdensity pois||kdensity t2||kdensity t3,name(tx) foreach v of varlist pois reg t* { g mse_`v'=(`v'-1)^2 } su r_* su mse_* For those who have read down this far: increase the number of reps to at least 10,000 for reasonable estimates, and try out different coefs for x and the constant, and different imputations for the lower limit, to see that the results are robust. Also note that the effect of x on ln(y+1) is not the same as the effect of x on ln(y) so of course the "reg" results are not directly comparable, but many authors treat them as if they are, and Rich mentions no post-estimation adjustment below, so I have ignored that point in keeping with standard practice. On Wed, Jul 15, 2009 at 12:38 PM, Rich Steinberg<rsteinbe@iupui.edu> wrote: > If, instead of just replacing the zeros, you add a small number (1 or 10) to > every left hand variable b4 taking logs, you are not doing anything except > slightly translating the origin of the log log curve. (of course, adjust the > lower limit of the tobit if you do this). > Since you probably have no reason to believe that the log-log curve ought to > be centered at any particular point, this is ok -- you could even try > different values of the constant to see which gives you the best fit. > > Whether you want to do this or the other ideas suggested by Austin depends > on what the residuals look like for each. It is a combination of a > functional form and error distribution question. > > Austin Nichols wrote: >> >> Dana Chandler<dchandler@gmail.com> : >> The log of zero is missing for a reason, as the quantity is undefined. >> You should ignore the advice of anyone who suggests replacing the >> zeros with ones before taking logs, which is demonstrably wrong, as is >> the sometimes used strategy of replacing the zeros with a value (call >> it u) smaller than any observed positive value, taking logs, then >> applying -tobit- with a lower limit at ln(u). OTOH, -poisson- or -glm- >> with a log link regressing y on x will give you results comparable to >> regressing ln(y) on x and includes the y=0 cases in a natural way. >> Make sure you use robust SEs and see also the help file for -ivpois- >> on SSC. >> >> On Wed, Jul 15, 2009 at 9:17 AM, Dana Chandler<dchandler@gmail.com> wrote: >> >>> >>> I would like to run a regression of a given variable's log on another >>> set of variables. How should I handle the 0 values? >>> >>> I have searched for an answer and saw some people say that "you cannot >>> run a regression with logs on values of zero... those values should be >>> considered 'missing' in the regression." Another suggestion was to >>> replace the 0s with 1s. >>> >>> Any thoughts? >>> >>> Thanks in advance, >>> Dana * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Handling 0 values when using logs of a dependent variable***From:*Rich Steinberg <rsteinbe@iupui.edu>

**References**:**st: Handling 0 values when using logs of a dependent variable***From:*Dana Chandler <dchandler@gmail.com>

**Re: st: Handling 0 values when using logs of a dependent variable***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Handling 0 values when using logs of a dependent variable***From:*Rich Steinberg <rsteinbe@iupui.edu>

- Prev by Date:
**st: RE: Average for panel data** - Next by Date:
**st: oaxaca with subpop in svy** - Previous by thread:
**Re: st: Handling 0 values when using logs of a dependent variable** - Next by thread:
**Re: st: Handling 0 values when using logs of a dependent variable** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |