# Re: st: Handling 0 values when using logs of a dependent variable

 From Dana Chandler To statalist@hsphsun2.harvard.edu Subject Re: st: Handling 0 values when using logs of a dependent variable Date Thu, 16 Jul 2009 02:04:25 -0500

```Many thanks to everyone who took the time to respond to this. Thanks
especially Rich for taking time to explain how to run a simulation and
see how other popularly recommended substitution methods fall short.

Best,
Dana

On Wed, Jul 15, 2009 at 2:00 PM, Rich Steinberg<rsteinbe@iupui.edu> wrote:
> Thanks for the education, which I won't have time to absorb now (I haven't
> learned how to program in stata) but will look back on.  But here is some
>
> 1) it surely matters whether those zeros are true zeros or whether the
> missing data was coded as zero.  Tobit treats  both kinds of zeros the same
> -- it assumes that there is a latent variable that can take negative values
> but those negative values are censored.  You might prefer to say that the
> zeros are real, not a code for censored values of the latent variable, and
> then Austin is right regardless because tobit's underlying assumption is
> logically inconsistent with the process generating the data.  (In such a
> situation, tobit may or may not work well as an approximation -- that is
> part of what the simulation is about -- but logical consistency is worth
> something).
> 2) Tobit has other problems that may make it a very bad choice in Dana's
> situation -- it is not consistent if the error term is non-normal, and it
> imposes proportionality between the marginal effect of the independent
> variable on the probability the depvar will be nonzero and on the increment
> to depvar when it starts out positive.  So while adding a constant before
> logging is still necessary, CLAD, a user-provided stata routine, is another
> quantile regression with censored depvar.
>
> 3) I don't think the point Austin raises at the bottom is very important, as
> a practical matter, if you choose a small constant (not just smaller than
> the lowest observed value, but much smaller).  The coefficient of log X on
> log (Y+ k) is the elasticity of Y+k with respect to X., which is going to be
> very close in value to the ordinary elasticity around the mean of your
> sample.  Now, if you absolutely knew that the true (ordinary) elasticity was
> constant across the range of your sample, when you estimate a "Y+k"
> elasticity that is constant, the implied "Y" elasticity will be variable.
>  But we almost never have a priori reason to believe in a true functional
> form that is constant elasticity, so I see no problem with this.
> 4) Regardless, in most tobit cases, you want to do mfx using ystar and
> report these results rather than the tobit coefficients.  I assume you know
> about that, but if not, the classic reference is McDonald and Moffitt and it
> is documented many places.
>
>
>
> Austin Nichols wrote:
>>
>> Rich Steinberg<rsteinbe@iupui.edu> :
>> Not so--for many purposes, adding a constant to all observations is
>> worse, as demonstrated by the attached simulation.  Making the
>> missings (logged zeros) a number smaller than all others and running
>> -tobit- can work quite well if you pick the right number--but unless
>> you know the right answer already, you cannot be assured of picking
>> the right number, also demonstrated in the simulation below, and you
>> will often wind up worse off!  On the other hand, the -poisson-
>> approach, or equivalent -glm- call, works more generally, in the sense
>> of giving a distribution of estimated coefficients centered near the
>> true value and having a size of test (rejection rate) better than the
>> alternatives (closer to 5%).
>>
>> clear all
>> prog simlny, rclass
>> drawnorm x e, clear n(1000)
>> g y=round(exp(x+e),.5)
>> count if y==0
>> return scalar n0=r(N)
>> poisson y x, r
>> return scalar pois=_b[x]
>> test x=1
>> return scalar p_pois=r(p)
>> return scalar r_pois=r(p)<.05
>> g lny=ln(y+1)
>> reg lny x, r
>> return scalar reg=_b[x]
>> test x=1
>> return scalar p_reg=r(p)
>> return scalar r_reg=r(p)<.05
>> tobit lny x, ll r
>> return scalar tobit=_b[x]
>> test x=1
>> return scalar p_tobit=r(p)
>> return scalar r_tobit=r(p)<.05
>> g lny2=max(ln(y),ln(.25))
>> tobit lny2 x, ll r
>> return scalar t2=_b[x]
>> test x=1
>> return scalar p_t2=r(p)
>> return scalar r_t2=r(p)<.05
>> g lny3=max(ln(y),ln(.025))
>> tobit lny3 x, ll r
>> return scalar t3=_b[x]
>> test x=1
>> return scalar p_t3=r(p)
>> return scalar r_t3=r(p)<.05
>> eret clear
>> end
>> simul,rep(100) seed(1): simlny
>> tw kdensity pois||kdensity reg||kdensity tobit
>> tw kdensity pois||kdensity t2||kdensity t3,name(tx)
>> foreach v of varlist pois reg t* {
>>  g mse_`v'=(`v'-1)^2
>>  }
>> su r_*
>> su mse_*
>>
>> For those who have read down this far: increase the number of reps to
>> at least 10,000 for reasonable estimates, and try out different coefs
>> for x and the constant, and different imputations for the lower limit,
>> to see that the results are robust.  Also note that the effect of x on
>> ln(y+1) is not the same as the effect of x on ln(y) so of course the
>> "reg" results are not directly comparable, but many authors treat them
>> as if they are, and Rich mentions no post-estimation adjustment below,
>> so I have ignored that point in keeping with standard practice.
>>
>> On Wed, Jul 15, 2009 at 12:38 PM, Rich Steinberg<rsteinbe@iupui.edu>
>> wrote:
>>
>>>
>>> If, instead of just replacing the zeros, you add a small number (1 or 10)
>>> to
>>> every left hand variable b4 taking logs, you are not doing anything
>>> except
>>> slightly translating the origin of the log log curve. (of course, adjust
>>> the
>>> lower limit of the tobit if you do this).
>>> Since you probably have no reason to believe that the log-log curve ought
>>> to
>>> be centered at any particular point, this is ok -- you could even try
>>> different values of the constant to see which gives you the best fit.
>>>
>>> Whether you want to do this or the other ideas suggested by Austin
>>> depends
>>> on what the residuals look like for each.  It is a combination of a
>>> functional form and error distribution question.
>>>
>>> Austin Nichols wrote:
>>>
>>>>
>>>> Dana Chandler<dchandler@gmail.com> :
>>>> The log of zero is missing for a reason, as the quantity is undefined.
>>>> You should ignore the advice of anyone who suggests replacing the
>>>> zeros with ones before taking logs, which is demonstrably wrong, as is
>>>> the sometimes used strategy of replacing the zeros with a value (call
>>>> it u) smaller than any observed positive value, taking logs, then
>>>> applying -tobit- with a lower limit at ln(u). OTOH, -poisson- or -glm-
>>>> with a log link regressing y on x will give you results comparable to
>>>> regressing ln(y) on x and includes the y=0 cases in a natural way.
>>>> Make sure you use robust SEs and see also the help file for -ivpois-
>>>> on SSC.
>>>>
>>>> On Wed, Jul 15, 2009 at 9:17 AM, Dana Chandler<dchandler@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>>
>>>>> I would like to run a regression of a given variable's log on another
>>>>> set of variables. How should I handle the 0 values?
>>>>>
>>>>> I have searched for an answer and saw some people say that "you cannot
>>>>> run a regression with logs on values of zero... those values should be
>>>>> considered 'missing' in the regression." Another suggestion was to
>>>>> replace the 0s with 1s.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>> Dana
>>>>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```