[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Dana Chandler <dchandler@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Handling 0 values when using logs of a dependent variable |

Date |
Thu, 16 Jul 2009 02:04:25 -0500 |

Many thanks to everyone who took the time to respond to this. Thanks especially Rich for taking time to explain how to run a simulation and see how other popularly recommended substitution methods fall short. Best, Dana On Wed, Jul 15, 2009 at 2:00 PM, Rich Steinberg<rsteinbe@iupui.edu> wrote: > Thanks for the education, which I won't have time to absorb now (I haven't > learned how to program in stata) but will look back on. But here is some > more advice for Dana. > > 1) it surely matters whether those zeros are true zeros or whether the > missing data was coded as zero. Tobit treats both kinds of zeros the same > -- it assumes that there is a latent variable that can take negative values > but those negative values are censored. You might prefer to say that the > zeros are real, not a code for censored values of the latent variable, and > then Austin is right regardless because tobit's underlying assumption is > logically inconsistent with the process generating the data. (In such a > situation, tobit may or may not work well as an approximation -- that is > part of what the simulation is about -- but logical consistency is worth > something). > 2) Tobit has other problems that may make it a very bad choice in Dana's > situation -- it is not consistent if the error term is non-normal, and it > imposes proportionality between the marginal effect of the independent > variable on the probability the depvar will be nonzero and on the increment > to depvar when it starts out positive. So while adding a constant before > logging is still necessary, CLAD, a user-provided stata routine, is another > alternative you should think about. CLAD is semi-parametric -- essentially > quantile regression with censored depvar. > > 3) I don't think the point Austin raises at the bottom is very important, as > a practical matter, if you choose a small constant (not just smaller than > the lowest observed value, but much smaller). The coefficient of log X on > log (Y+ k) is the elasticity of Y+k with respect to X., which is going to be > very close in value to the ordinary elasticity around the mean of your > sample. Now, if you absolutely knew that the true (ordinary) elasticity was > constant across the range of your sample, when you estimate a "Y+k" > elasticity that is constant, the implied "Y" elasticity will be variable. > But we almost never have a priori reason to believe in a true functional > form that is constant elasticity, so I see no problem with this. > 4) Regardless, in most tobit cases, you want to do mfx using ystar and > report these results rather than the tobit coefficients. I assume you know > about that, but if not, the classic reference is McDonald and Moffitt and it > is documented many places. > > > > Austin Nichols wrote: >> >> Rich Steinberg<rsteinbe@iupui.edu> : >> Not so--for many purposes, adding a constant to all observations is >> worse, as demonstrated by the attached simulation. Making the >> missings (logged zeros) a number smaller than all others and running >> -tobit- can work quite well if you pick the right number--but unless >> you know the right answer already, you cannot be assured of picking >> the right number, also demonstrated in the simulation below, and you >> will often wind up worse off! On the other hand, the -poisson- >> approach, or equivalent -glm- call, works more generally, in the sense >> of giving a distribution of estimated coefficients centered near the >> true value and having a size of test (rejection rate) better than the >> alternatives (closer to 5%). >> >> clear all >> prog simlny, rclass >> drawnorm x e, clear n(1000) >> g y=round(exp(x+e),.5) >> count if y==0 >> return scalar n0=r(N) >> poisson y x, r >> return scalar pois=_b[x] >> test x=1 >> return scalar p_pois=r(p) >> return scalar r_pois=r(p)<.05 >> g lny=ln(y+1) >> reg lny x, r >> return scalar reg=_b[x] >> test x=1 >> return scalar p_reg=r(p) >> return scalar r_reg=r(p)<.05 >> tobit lny x, ll r >> return scalar tobit=_b[x] >> test x=1 >> return scalar p_tobit=r(p) >> return scalar r_tobit=r(p)<.05 >> g lny2=max(ln(y),ln(.25)) >> tobit lny2 x, ll r >> return scalar t2=_b[x] >> test x=1 >> return scalar p_t2=r(p) >> return scalar r_t2=r(p)<.05 >> g lny3=max(ln(y),ln(.025)) >> tobit lny3 x, ll r >> return scalar t3=_b[x] >> test x=1 >> return scalar p_t3=r(p) >> return scalar r_t3=r(p)<.05 >> eret clear >> end >> simul,rep(100) seed(1): simlny >> tw kdensity pois||kdensity reg||kdensity tobit >> tw kdensity pois||kdensity t2||kdensity t3,name(tx) >> foreach v of varlist pois reg t* { >> g mse_`v'=(`v'-1)^2 >> } >> su r_* >> su mse_* >> >> For those who have read down this far: increase the number of reps to >> at least 10,000 for reasonable estimates, and try out different coefs >> for x and the constant, and different imputations for the lower limit, >> to see that the results are robust. Also note that the effect of x on >> ln(y+1) is not the same as the effect of x on ln(y) so of course the >> "reg" results are not directly comparable, but many authors treat them >> as if they are, and Rich mentions no post-estimation adjustment below, >> so I have ignored that point in keeping with standard practice. >> >> On Wed, Jul 15, 2009 at 12:38 PM, Rich Steinberg<rsteinbe@iupui.edu> >> wrote: >> >>> >>> If, instead of just replacing the zeros, you add a small number (1 or 10) >>> to >>> every left hand variable b4 taking logs, you are not doing anything >>> except >>> slightly translating the origin of the log log curve. (of course, adjust >>> the >>> lower limit of the tobit if you do this). >>> Since you probably have no reason to believe that the log-log curve ought >>> to >>> be centered at any particular point, this is ok -- you could even try >>> different values of the constant to see which gives you the best fit. >>> >>> Whether you want to do this or the other ideas suggested by Austin >>> depends >>> on what the residuals look like for each. It is a combination of a >>> functional form and error distribution question. >>> >>> Austin Nichols wrote: >>> >>>> >>>> Dana Chandler<dchandler@gmail.com> : >>>> The log of zero is missing for a reason, as the quantity is undefined. >>>> You should ignore the advice of anyone who suggests replacing the >>>> zeros with ones before taking logs, which is demonstrably wrong, as is >>>> the sometimes used strategy of replacing the zeros with a value (call >>>> it u) smaller than any observed positive value, taking logs, then >>>> applying -tobit- with a lower limit at ln(u). OTOH, -poisson- or -glm- >>>> with a log link regressing y on x will give you results comparable to >>>> regressing ln(y) on x and includes the y=0 cases in a natural way. >>>> Make sure you use robust SEs and see also the help file for -ivpois- >>>> on SSC. >>>> >>>> On Wed, Jul 15, 2009 at 9:17 AM, Dana Chandler<dchandler@gmail.com> >>>> wrote: >>>> >>>> >>>>> >>>>> I would like to run a regression of a given variable's log on another >>>>> set of variables. How should I handle the 0 values? >>>>> >>>>> I have searched for an answer and saw some people say that "you cannot >>>>> run a regression with logs on values of zero... those values should be >>>>> considered 'missing' in the regression." Another suggestion was to >>>>> replace the 0s with 1s. >>>>> >>>>> Any thoughts? >>>>> >>>>> Thanks in advance, >>>>> Dana >>>>> >> >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/statalist/faq >> * http://www.ats.ucla.edu/stat/stata/ >> > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Handling 0 values when using logs of a dependent variable***From:*Dana Chandler <dchandler@gmail.com>

**Re: st: Handling 0 values when using logs of a dependent variable***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Handling 0 values when using logs of a dependent variable***From:*Rich Steinberg <rsteinbe@iupui.edu>

**Re: st: Handling 0 values when using logs of a dependent variable***From:*Austin Nichols <austinnichols@gmail.com>

**Re: st: Handling 0 values when using logs of a dependent variable***From:*Rich Steinberg <rsteinbe@iupui.edu>

- Prev by Date:
**st: Restricting range of values in a graph** - Next by Date:
**st: AW: Restricting range of values in a graph** - Previous by thread:
**Re: st: Handling 0 values when using logs of a dependent variable** - Next by thread:
**RE: st: Handling 0 values when using logs of a dependent variable** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |