Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Handling 0 values when using logs of a dependent variable


From   "Nick Cox" <n.j.cox@durham.ac.uk>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Handling 0 values when using logs of a dependent variable
Date   Thu, 16 Jul 2009 15:10:58 +0100

Rich seems to me to duck Austin's main point which I take to be that
using log(x + constant) just raises the question of what -constant- is
to be, and that is difficult to get right, to say the least. Here x
could be, and often is, the response variable. Also at a minimum
-constant- must ensure that the logarithm yields determinate real
values. 

A common myth, which both Rich and Allan seem to echo, is that making
-constant- very small while still positive is a very small fudge but
this is the exact opposite of the case! Think of log(1/million) =
-log(million), log(1/billion) = -log(billion), etc. You create
substantial outliers in terms of the transformed responses if you use
very small -constant-. 

Rich's point 3) seems to hinge on the idea that getting an elasticity is
the key aim. Problems and above all disciplines differ, but I've been
calculating regressions for nearly 40 years and have yet to want an
elasticity as an end result. 

The more fundamental issue is why logarithms are wanted in the first
place. To me the best reason is that it matches the fact that the
underlying dynamics are essentially multiplicative rather than additive,
which is reflected in the model specification. That's difficult to
square with both the presence of zeros and the use of log(x + constant).


If the rationale for using logarithms is to make the distribution more
symmetrical or more nearly normal, then the problem is different, but
one easy alternative is use a power function instead. Also, people often
want to transform the response to get a more congenial distribution and
forget that the distribution of error terms is more often the important
thing -- and often not so crucial either. 

A further standard point touched upon repeatedly is what the zeros
represent. There seems to be a spectrum between two extremes:

1. At least some zeros represent a qualitatively different group, e.g.
those who reply zero to exports, smoking, Twitter activity, SAS use,
etc., because they don't export, smoke, Twitter, use SAS, etc. There are
numerous different ways to model that mixed situation, including leaving
out the qualitatively different observations, zero-inflated models,
two-part models, etc. Don't ask me what the etc. means: I just assume my
list can't be complete. 

2. Zeros just mean not observed, and nothing more. This seems more
standard in ecology where zero means no fishes caught, no birds seen, no
plants in the quadrat, whatever, not they don't exist somewhere near. 
Somewhat like Allan, or so I imagine, I've often seen log(x + 1) in
ecological literature, although usually with justification ranging from
missing to vague. At least it's preferable to cond(x == 1, 0, log(x)). 

The white magic of -glm, link(log)- with observed zeros seems to deserve
even more publicity! I don't have enough experience with -tobit- to
comment purposefully, although I do suspect that it is a little
oversold. 

Nick 
n.j.cox@durham.ac.uk 

Allan Reese
===========

I've seen two examples very recently where someone used ln(x+1) as a
reflex action.  Don't agree that ln(0) is undefined; it's mathematically
well defined but as an asymptote.  Agree that the answer to Dana's
question requires more background information. 

First point to check is whether the zeros are actual data values or
themselves are missing values. ln(.) really is .

Second option is that zero is the code for "too small to measure" or
"below limit of detection".  In that case replacing the zeros with
teeny-weeny values or using tobit may be the best choice.

The usual reason for using ln(x+1) is when x is a count variable, which
is left-truncated but discrete and can go indefinitely large.  Counts
typically show a unimodal highly-skewed distribution (Poisson or
otherwise).  Mapping zero to zero, ln(1), does not offend and ln(x+1) is
often then treated as a normal variate; ie, you can do anova on it.

If, however, x is a ratio (y/z) then 1 represents the mid-point and both
y=0 and z=0 cause problems.  When y is less than z, ln(y/z) is a
negative number, and by taking ln(x+1) in one of my examples half the
data was effectively dropped by mapping into the range 0 to ln(2).  If
the raw data are y and z, you have options in GLMs.  If you have only x,
a tobit with left and right censored values is appropriate.

Rich Steinberg
==============

Thanks for the education, which I won't have time to absorb now (I 
haven't learned how to program in stata) but will look back on.  But 
here is some more advice for Dana.

1) it surely matters whether those zeros are true zeros or whether the 
missing data was coded as zero.  Tobit treats  both kinds of zeros the 
same -- it assumes that there is a latent variable that can take 
negative values but those negative values are censored.  You might 
prefer to say that the zeros are real, not a code for censored values of

the latent variable, and then Austin is right regardless because tobit's

underlying assumption is logically inconsistent with the process 
generating the data.  (In such a situation, tobit may or may not work 
well as an approximation -- that is part of what the simulation is about

-- but logical consistency is worth something). 

2) Tobit has other problems that may make it a very bad choice in Dana's

situation -- it is not consistent if the error term is non-normal, and 
it imposes proportionality between the marginal effect of the 
independent variable on the probability the depvar will be nonzero and 
on the increment to depvar when it starts out positive.  So while adding

a constant before logging is still necessary, CLAD, a user-provided 
stata routine, is another alternative you should think about. CLAD is 
semi-parametric -- essentially quantile regression with censored depvar.

3) I don't think the point Austin raises at the bottom is very 
important, as a practical matter, if you choose a small constant (not 
just smaller than the lowest observed value, but much smaller).  The 
coefficient of log X on log (Y+ k) is the elasticity of Y+k with respect

to X., which is going to be very close in value to the ordinary 
elasticity around the mean of your sample.  Now, if you absolutely knew 
that the true (ordinary) elasticity was constant across the range of 
your sample, when you estimate a "Y+k" elasticity that is constant, the 
implied "Y" elasticity will be variable.  But we almost never have a 
priori reason to believe in a true functional form that is constant 
elasticity, so I see no problem with this. 

4) Regardless, in most tobit cases, you want to do mfx using ystar and 
report these results rather than the tobit coefficients.  I assume you 
know about that, but if not, the classic reference is McDonald and 
Moffitt and it is documented many places.

Austin Nichols
==============

> Rich Steinberg<rsteinbe@iupui.edu> :
> Not so--for many purposes, adding a constant to all observations is
> worse, as demonstrated by the attached simulation.  Making the
> missings (logged zeros) a number smaller than all others and running
> -tobit- can work quite well if you pick the right number--but unless
> you know the right answer already, you cannot be assured of picking
> the right number, also demonstrated in the simulation below, and you
> will often wind up worse off!  On the other hand, the -poisson-
> approach, or equivalent -glm- call, works more generally, in the sense
> of giving a distribution of estimated coefficients centered near the
> true value and having a size of test (rejection rate) better than the
> alternatives (closer to 5%).
>
> clear all
> prog simlny, rclass
> drawnorm x e, clear n(1000)
> g y=round(exp(x+e),.5)
> count if y==0
> return scalar n0=r(N)
> poisson y x, r
> return scalar pois=_b[x]
> test x=1
> return scalar p_pois=r(p)
> return scalar r_pois=r(p)<.05
> g lny=ln(y+1)
> reg lny x, r
> return scalar reg=_b[x]
> test x=1
> return scalar p_reg=r(p)
> return scalar r_reg=r(p)<.05
> tobit lny x, ll r
> return scalar tobit=_b[x]
> test x=1
> return scalar p_tobit=r(p)
> return scalar r_tobit=r(p)<.05
> g lny2=max(ln(y),ln(.25))
> tobit lny2 x, ll r
> return scalar t2=_b[x]
> test x=1
> return scalar p_t2=r(p)
> return scalar r_t2=r(p)<.05
> g lny3=max(ln(y),ln(.025))
> tobit lny3 x, ll r
> return scalar t3=_b[x]
> test x=1
> return scalar p_t3=r(p)
> return scalar r_t3=r(p)<.05
> eret clear
> end
> simul,rep(100) seed(1): simlny
> tw kdensity pois||kdensity reg||kdensity tobit
> tw kdensity pois||kdensity t2||kdensity t3,name(tx)
> foreach v of varlist pois reg t* {
>  g mse_`v'=(`v'-1)^2
>  }
> su r_*
> su mse_*
>
> For those who have read down this far: increase the number of reps to
> at least 10,000 for reasonable estimates, and try out different coefs
> for x and the constant, and different imputations for the lower limit,
> to see that the results are robust.  Also note that the effect of x on
> ln(y+1) is not the same as the effect of x on ln(y) so of course the
> "reg" results are not directly comparable, but many authors treat them
> as if they are, and Rich mentions no post-estimation adjustment below,
> so I have ignored that point in keeping with standard practice.
>
> On Wed, Jul 15, 2009 at 12:38 PM, Rich Steinberg<rsteinbe@iupui.edu>
wrote:
>   
>> If, instead of just replacing the zeros, you add a small number (1 or
10) to
>> every left hand variable b4 taking logs, you are not doing anything
except
>> slightly translating the origin of the log log curve. (of course,
adjust the
>> lower limit of the tobit if you do this).
>> Since you probably have no reason to believe that the log-log curve
ought to
>> be centered at any particular point, this is ok -- you could even try
>> different values of the constant to see which gives you the best fit.
>>
>> Whether you want to do this or the other ideas suggested by Austin
depends
>> on what the residuals look like for each.  It is a combination of a
>> functional form and error distribution question.
>>
>> Austin Nichols wrote:
>>     
>>> Dana Chandler<dchandler@gmail.com> :
>>> The log of zero is missing for a reason, as the quantity is
undefined.
>>> You should ignore the advice of anyone who suggests replacing the
>>> zeros with ones before taking logs, which is demonstrably wrong, as
is
>>> the sometimes used strategy of replacing the zeros with a value
(call
>>> it u) smaller than any observed positive value, taking logs, then
>>> applying -tobit- with a lower limit at ln(u). OTOH, -poisson- or
-glm-
>>> with a log link regressing y on x will give you results comparable
to
>>> regressing ln(y) on x and includes the y=0 cases in a natural way.
>>> Make sure you use robust SEs and see also the help file for -ivpois-
>>> on SSC.
>>>
>>> On Wed, Jul 15, 2009 at 9:17 AM, Dana Chandler<dchandler@gmail.com>
wrote:
>>>
>>>       
>>>> I would like to run a regression of a given variable's log on
another
>>>> set of variables. How should I handle the 0 values?
>>>>
>>>> I have searched for an answer and saw some people say that "you
cannot
>>>> run a regression with logs on values of zero... those values should
be
>>>> considered 'missing' in the regression." Another suggestion was to
>>>> replace the 0s with 1s.
>>>>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index