[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Handling 0 values when using logs of a dependent variable

From	Rich Steinberg <[email protected]>
To	[email protected]
Subject	Re: st: Handling 0 values when using logs of a dependent variable
Date	Wed, 15 Jul 2009 15:00:36 -0400

Thanks for the education, which I won't have time to absorb now (Ihaven't learned how to program in stata) but will look back on. Buthere is some more advice for Dana.

1) it surely matters whether those zeros are true zeros or whether themissing data was coded as zero. Tobit treats both kinds of zeros thesame -- it assumes that there is a latent variable that can takenegative values but those negative values are censored. You mightprefer to say that the zeros are real, not a code for censored values ofthe latent variable, and then Austin is right regardless because tobit'sunderlying assumption is logically inconsistent with the processgenerating the data. (In such a situation, tobit may or may not workwell as an approximation -- that is part of what the simulation is about-- but logical consistency is worth something).2) Tobit has other problems that may make it a very bad choice in Dana'ssituation -- it is not consistent if the error term is non-normal, andit imposes proportionality between the marginal effect of theindependent variable on the probability the depvar will be nonzero andon the increment to depvar when it starts out positive. So while addinga constant before logging is still necessary, CLAD, a user-providedstata routine, is another alternative you should think about. CLAD issemi-parametric -- essentially quantile regression with censored depvar.

3) I don't think the point Austin raises at the bottom is veryimportant, as a practical matter, if you choose a small constant (notjust smaller than the lowest observed value, but much smaller). Thecoefficient of log X on log (Y+ k) is the elasticity of Y+k with respectto X., which is going to be very close in value to the ordinaryelasticity around the mean of your sample. Now, if you absolutely knewthat the true (ordinary) elasticity was constant across the range ofyour sample, when you estimate a "Y+k" elasticity that is constant, theimplied "Y" elasticity will be variable. But we almost never have apriori reason to believe in a true functional form that is constantelasticity, so I see no problem with this.4) Regardless, in most tobit cases, you want to do mfx using ystar andreport these results rather than the tobit coefficients. I assume youknow about that, but if not, the classic reference is McDonald andMoffitt and it is documented many places.




Austin Nichols wrote:

Rich Steinberg<[email protected]> :
Not so--for many purposes, adding a constant to all observations is
worse, as demonstrated by the attached simulation.  Making the
missings (logged zeros) a number smaller than all others and running
-tobit- can work quite well if you pick the right number--but unless
you know the right answer already, you cannot be assured of picking
the right number, also demonstrated in the simulation below, and you
will often wind up worse off!  On the other hand, the -poisson-
approach, or equivalent -glm- call, works more generally, in the sense
of giving a distribution of estimated coefficients centered near the
true value and having a size of test (rejection rate) better than the
alternatives (closer to 5%).

clear all
prog simlny, rclass
drawnorm x e, clear n(1000)
g y=round(exp(x+e),.5)
count if y==0
return scalar n0=r(N)
poisson y x, r
return scalar pois=_b[x]
test x=1
return scalar p_pois=r(p)
return scalar r_pois=r(p)<.05
g lny=ln(y+1)
reg lny x, r
return scalar reg=_b[x]
test x=1
return scalar p_reg=r(p)
return scalar r_reg=r(p)<.05
tobit lny x, ll r
return scalar tobit=_b[x]
test x=1
return scalar p_tobit=r(p)
return scalar r_tobit=r(p)<.05
g lny2=max(ln(y),ln(.25))
tobit lny2 x, ll r
return scalar t2=_b[x]
test x=1
return scalar p_t2=r(p)
return scalar r_t2=r(p)<.05
g lny3=max(ln(y),ln(.025))
tobit lny3 x, ll r
return scalar t3=_b[x]
test x=1
return scalar p_t3=r(p)
return scalar r_t3=r(p)<.05
eret clear
end
simul,rep(100) seed(1): simlny
tw kdensity pois||kdensity reg||kdensity tobit
tw kdensity pois||kdensity t2||kdensity t3,name(tx)
foreach v of varlist pois reg t* {
 g mse_`v'=(`v'-1)^2
 }
su r_*
su mse_*

For those who have read down this far: increase the number of reps to
at least 10,000 for reasonable estimates, and try out different coefs
for x and the constant, and different imputations for the lower limit,
to see that the results are robust.  Also note that the effect of x on
ln(y+1) is not the same as the effect of x on ln(y) so of course the
"reg" results are not directly comparable, but many authors treat them
as if they are, and Rich mentions no post-estimation adjustment below,
so I have ignored that point in keeping with standard practice.

On Wed, Jul 15, 2009 at 12:38 PM, Rich Steinberg<[email protected]> wrote:

If, instead of just replacing the zeros, you add a small number (1 or 10) to
every left hand variable b4 taking logs, you are not doing anything except
slightly translating the origin of the log log curve. (of course, adjust the
lower limit of the tobit if you do this).
Since you probably have no reason to believe that the log-log curve ought to
be centered at any particular point, this is ok -- you could even try
different values of the constant to see which gives you the best fit.

Whether you want to do this or the other ideas suggested by Austin depends
on what the residuals look like for each.  It is a combination of a
functional form and error distribution question.

Austin Nichols wrote:

Dana Chandler<[email protected]> :
The log of zero is missing for a reason, as the quantity is undefined.
You should ignore the advice of anyone who suggests replacing the
zeros with ones before taking logs, which is demonstrably wrong, as is
the sometimes used strategy of replacing the zeros with a value (call
it u) smaller than any observed positive value, taking logs, then
applying -tobit- with a lower limit at ln(u). OTOH, -poisson- or -glm-
with a log link regressing y on x will give you results comparable to
regressing ln(y) on x and includes the y=0 cases in a natural way.
Make sure you use robust SEs and see also the help file for -ivpois-
on SSC.

On Wed, Jul 15, 2009 at 9:17 AM, Dana Chandler<[email protected]> wrote:

I would like to run a regression of a given variable's log on another
set of variables. How should I handle the 0 values?

I have searched for an answer and saw some people say that "you cannot
run a regression with logs on values of zero... those values should be
considered 'missing' in the regression." Another suggestion was to
replace the 0s with 1s.

Any thoughts?

Thanks in advance,
Dana


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: Handling 0 values when using logs of a dependent variable
  - From: "Nick Cox" <[email protected]>
- Re: st: Handling 0 values when using logs of a dependent variable
  - From: Dana Chandler <[email protected]>

References:
- st: Handling 0 values when using logs of a dependent variable
  - From: Dana Chandler <[email protected]>
- Re: st: Handling 0 values when using logs of a dependent variable
  - From: Austin Nichols <[email protected]>
- Re: st: Handling 0 values when using logs of a dependent variable
  - From: Rich Steinberg <[email protected]>
- Re: st: Handling 0 values when using logs of a dependent variable
  - From: Austin Nichols <[email protected]>

Prev by Date: Re: st: oaxaca with subpop in svy
Next by Date: st: ttest after predict after xtmelogit
Previous by thread: Re: st: Handling 0 values when using logs of a dependent variable
Next by thread: Re: st: Handling 0 values when using logs of a dependent variable
Index(es):
- Date
- Thread