[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Handling 0 values when using logs of a dependent variable

From   Dana Chandler <>
Subject   Re: st: Handling 0 values when using logs of a dependent variable
Date   Thu, 16 Jul 2009 02:04:25 -0500

Many thanks to everyone who took the time to respond to this. Thanks
especially Rich for taking time to explain how to run a simulation and
see how other popularly recommended substitution methods fall short.


On Wed, Jul 15, 2009 at 2:00 PM, Rich Steinberg<> wrote:
> Thanks for the education, which I won't have time to absorb now (I haven't
> learned how to program in stata) but will look back on.  But here is some
> more advice for Dana.
> 1) it surely matters whether those zeros are true zeros or whether the
> missing data was coded as zero.  Tobit treats  both kinds of zeros the same
> -- it assumes that there is a latent variable that can take negative values
> but those negative values are censored.  You might prefer to say that the
> zeros are real, not a code for censored values of the latent variable, and
> then Austin is right regardless because tobit's underlying assumption is
> logically inconsistent with the process generating the data.  (In such a
> situation, tobit may or may not work well as an approximation -- that is
> part of what the simulation is about -- but logical consistency is worth
> something).
> 2) Tobit has other problems that may make it a very bad choice in Dana's
> situation -- it is not consistent if the error term is non-normal, and it
> imposes proportionality between the marginal effect of the independent
> variable on the probability the depvar will be nonzero and on the increment
> to depvar when it starts out positive.  So while adding a constant before
> logging is still necessary, CLAD, a user-provided stata routine, is another
> alternative you should think about. CLAD is semi-parametric -- essentially
> quantile regression with censored depvar.
> 3) I don't think the point Austin raises at the bottom is very important, as
> a practical matter, if you choose a small constant (not just smaller than
> the lowest observed value, but much smaller).  The coefficient of log X on
> log (Y+ k) is the elasticity of Y+k with respect to X., which is going to be
> very close in value to the ordinary elasticity around the mean of your
> sample.  Now, if you absolutely knew that the true (ordinary) elasticity was
> constant across the range of your sample, when you estimate a "Y+k"
> elasticity that is constant, the implied "Y" elasticity will be variable.
>  But we almost never have a priori reason to believe in a true functional
> form that is constant elasticity, so I see no problem with this.
> 4) Regardless, in most tobit cases, you want to do mfx using ystar and
> report these results rather than the tobit coefficients.  I assume you know
> about that, but if not, the classic reference is McDonald and Moffitt and it
> is documented many places.
> Austin Nichols wrote:
>> Rich Steinberg<> :
>> Not so--for many purposes, adding a constant to all observations is
>> worse, as demonstrated by the attached simulation.  Making the
>> missings (logged zeros) a number smaller than all others and running
>> -tobit- can work quite well if you pick the right number--but unless
>> you know the right answer already, you cannot be assured of picking
>> the right number, also demonstrated in the simulation below, and you
>> will often wind up worse off!  On the other hand, the -poisson-
>> approach, or equivalent -glm- call, works more generally, in the sense
>> of giving a distribution of estimated coefficients centered near the
>> true value and having a size of test (rejection rate) better than the
>> alternatives (closer to 5%).
>> clear all
>> prog simlny, rclass
>> drawnorm x e, clear n(1000)
>> g y=round(exp(x+e),.5)
>> count if y==0
>> return scalar n0=r(N)
>> poisson y x, r
>> return scalar pois=_b[x]
>> test x=1
>> return scalar p_pois=r(p)
>> return scalar r_pois=r(p)<.05
>> g lny=ln(y+1)
>> reg lny x, r
>> return scalar reg=_b[x]
>> test x=1
>> return scalar p_reg=r(p)
>> return scalar r_reg=r(p)<.05
>> tobit lny x, ll r
>> return scalar tobit=_b[x]
>> test x=1
>> return scalar p_tobit=r(p)
>> return scalar r_tobit=r(p)<.05
>> g lny2=max(ln(y),ln(.25))
>> tobit lny2 x, ll r
>> return scalar t2=_b[x]
>> test x=1
>> return scalar p_t2=r(p)
>> return scalar r_t2=r(p)<.05
>> g lny3=max(ln(y),ln(.025))
>> tobit lny3 x, ll r
>> return scalar t3=_b[x]
>> test x=1
>> return scalar p_t3=r(p)
>> return scalar r_t3=r(p)<.05
>> eret clear
>> end
>> simul,rep(100) seed(1): simlny
>> tw kdensity pois||kdensity reg||kdensity tobit
>> tw kdensity pois||kdensity t2||kdensity t3,name(tx)
>> foreach v of varlist pois reg t* {
>>  g mse_`v'=(`v'-1)^2
>>  }
>> su r_*
>> su mse_*
>> For those who have read down this far: increase the number of reps to
>> at least 10,000 for reasonable estimates, and try out different coefs
>> for x and the constant, and different imputations for the lower limit,
>> to see that the results are robust.  Also note that the effect of x on
>> ln(y+1) is not the same as the effect of x on ln(y) so of course the
>> "reg" results are not directly comparable, but many authors treat them
>> as if they are, and Rich mentions no post-estimation adjustment below,
>> so I have ignored that point in keeping with standard practice.
>> On Wed, Jul 15, 2009 at 12:38 PM, Rich Steinberg<>
>> wrote:
>>> If, instead of just replacing the zeros, you add a small number (1 or 10)
>>> to
>>> every left hand variable b4 taking logs, you are not doing anything
>>> except
>>> slightly translating the origin of the log log curve. (of course, adjust
>>> the
>>> lower limit of the tobit if you do this).
>>> Since you probably have no reason to believe that the log-log curve ought
>>> to
>>> be centered at any particular point, this is ok -- you could even try
>>> different values of the constant to see which gives you the best fit.
>>> Whether you want to do this or the other ideas suggested by Austin
>>> depends
>>> on what the residuals look like for each.  It is a combination of a
>>> functional form and error distribution question.
>>> Austin Nichols wrote:
>>>> Dana Chandler<> :
>>>> The log of zero is missing for a reason, as the quantity is undefined.
>>>> You should ignore the advice of anyone who suggests replacing the
>>>> zeros with ones before taking logs, which is demonstrably wrong, as is
>>>> the sometimes used strategy of replacing the zeros with a value (call
>>>> it u) smaller than any observed positive value, taking logs, then
>>>> applying -tobit- with a lower limit at ln(u). OTOH, -poisson- or -glm-
>>>> with a log link regressing y on x will give you results comparable to
>>>> regressing ln(y) on x and includes the y=0 cases in a natural way.
>>>> Make sure you use robust SEs and see also the help file for -ivpois-
>>>> on SSC.
>>>> On Wed, Jul 15, 2009 at 9:17 AM, Dana Chandler<>
>>>> wrote:
>>>>> I would like to run a regression of a given variable's log on another
>>>>> set of variables. How should I handle the 0 values?
>>>>> I have searched for an answer and saw some people say that "you cannot
>>>>> run a regression with logs on values of zero... those values should be
>>>>> considered 'missing' in the regression." Another suggestion was to
>>>>> replace the 0s with 1s.
>>>>> Any thoughts?
>>>>> Thanks in advance,
>>>>> Dana
>> *
>> *   For searches and help try:
>> *
>> *
>> *
> *
> *   For searches and help try:
> *
> *
> *

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index