Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: adding to X to make ln(X) nonmissing [was BTSCS and Non-linear MLE programming]

From   "Nick Cox" <[email protected]>
To   <[email protected]>
Subject   st: RE: adding to X to make ln(X) nonmissing [was BTSCS and Non-linear MLE programming]
Date   Fri, 2 Feb 2007 14:07:47 -0000

Let me summarise the situation as I see it. This tries
a slightly more general pitch than the current thread. 

I want to work with the logarithm of a variable, but that 
variable contains zero values. What should I do? 

Be warned: this is treacherous territory in which competent
and experienced people can disagree. What is best to do will
depend on your research problem and on your data, so no 
unequivocal advice can be given. 

Be further warned: if you also have negative values, your
problem is even worse. Often, the short answer is just that
you are thinking of something that is not a good idea. Some
of the points below apply, sometimes with modifications to fit. 

1. The logarithm of a variable is not determinate whenever 
that variable is zero. In Stata (or Mata) log(0) returns 
. (numeric missing). 

2. There are many statistical and scientific reasons 
to work with logged variables as either responses or predictors. 
These range from situations in which theory suggests a model
that can be linearised by logarithmic transformation (e.g. 
radioactive decay, exponential growth) to situations in which
logarithmic transformation is defensible empirically as helping 
to improve behaviour, e.g. by making relationships more nearly linear. 

3. Although it is not as widely known as it should be, the fact 
that some values of a response are zero is in itself no barrier 
to fitting a model that has logarithmic link function (in 
generalized linear model terminology). This is so whenever the
key assumption is how log(mean response) varies, rather than how
mean(log response) varies. This is the case with, in particular, 
Poisson regression and more widely with generalized linear models. 
This parallels a much more widely known point: the fact that 
logit of 0 and logit of 1 are indeterminate is irrelevant to 
the applicability of logit models for binary responses. Naturally, 
the fact that you can apply such models does not establish that 
they work well or are justifiable otherwise. 

4. Perhaps surprisingly, zeros in a predictor that you want 
to take logarithms of pose a more awkward problem. No one seems
to have a solution that does not in turn seem problematic. 

(a) The zeros are in your sample, but do they belong in your
target population? Perhaps you should exclude those observations. 
The issue here is one of relevance: data irrelevant to a problem
do not belong in a modelling exercise. If you were working on 
cats, and some dogs wandered into your dataset, you would usually
exclude them as the result of a simple mistake. But what 
determines relevance? If your predictor is coffee consumption, 
people who drink no coffee would usually be considered relevant:
they give information on boundary or limiting conditions, part
of the specification of a well-posed problem. 

(b) If you were confident that the zeros were really a kind 
of missing value, imputation might appear justifiable. In 
environmental data, zero concentrations of something nasty might 
just mean failure to detect very small concentrations with the 
technology available. With some social science variables, zeros 
might just imply that people lied, or more generally that data 
production had failed to yield valid measurements. There are 
occasional media reports of people claiming not to eat indefinitely, 
and the only real issue is how are they cheating. Nevertheless 
celibates and abstainers of all kinds do exist. Your field may 
have literature on how best to deal 
with data that are best regarded as censored or truncated. 

(c) Some people fudge or nudge the zeros to small positive values. 
This is more defensible if it is a correction for a measurement 
problem (see (b) above) than because it is thought the only way 
to apply logarithms. Either way, it really needs to be
flagged in reports and well defended too. If data are integers, 
so that the smallest valid values are 1 and 2, then nudging 0
to 0.5 has some appeal as not only a compromise between 0 and 1
but -- more to the point here -- implying that successive 
values 0.5, 1, 2 are equally spaced on a logarithmic scale. 
But the appeal here is at best one of simplicity or symmetry, 
does not apply beyond 2, and does not reflect a statistical argument. More
generally, the idea is to replace the zeros by half the 
smallest non-zero measurement, given some convention about 
resolution of measurement (e.g. to a fixed number of decimal
places using agreed units). Note particularly that very 
small nudges are more problematic, not less, because after 
logging they place the values that were zeros further away 
from the others, and so are very likely to produce outliers 
on that variable. 

(d) A related idea is to adjust all variables by a transformation
of the form log(x + constant). A constant of 1 appeals to  
some, especially if the variable x is a count, but again this is 
more a matter of simplicity than anything else. log(x + 1)
goes to 0 as x goes to 0 and behaves like log(x) for large x. 
Estimating the constant from the data appeals to others, especially with 
the rhetoric of letting the data themselves indicate the best transformation. 
A trick of forcing the constant to be positive by a parameterisation like
log(x + exp(alpha)) -- so that exp(alpha) will always be positive, 
regardless of whether alpha is -- 
is objectionable on dimensional grounds, unless x is a pure number. 
This is because the result of exponentiation is a dimensionless number 
without units and so can be validly added only to another dimensionless
number. Either way, using log(x + constant) rather than 
log(x) gives a different model, not a different way of using log(x)
as a predictor. There is, however, scope for a sensitivity analysis:
vary the constant, and see how far the model results vary. 

[email protected] 

Austin Nichols
> In re: adding alpha to X to make ln(X) nonmissing
> Why does this operation come up so often, when it is so often a bad
> idea?  I have seen several papers this week that add some constant to
> X so that ln(X) can be regressed on some variables, or some variable
> can be regressed on it.  Wouldn't you be just as well off imputing
> 2*atan(X)-2*atan(1) or somesuch?  Is there a well-known good reference
> on this subject?
> Just now, when looking up the ref for an adjacent thread on btscs.ado,
> I ran across Oneal & Russett (2001) which acknowledges that Beck,
> Katz, and Tucker (1998) pointed out an error, and then replies to
> another critique with this (p.480):
> "
> Before taking the logarithm [of trade volume in $millions] we assigned
> a different value to the trade variable for dyads that report no
> trade.  Some value must be imputed because the logarithm of zero is
> undefined.  We use $100,000 [so really it was ln(0.1)]; Green, Kim,
> and Yoon used $1.  It is this that accounts for most of the
> differences between our results and theirs.
> "
> Oneal, John R. and Bruce Russett. 2001. Clear and Clean: The Fixed
> Effects of the Liberal Peace. International Organization, Vol. 55, No.
> 2. (Spring, 2001), pp. 469-485.
> C469%3ACACTFE%3E2.0.CO%3B2-A
> Green, Donald P., Soo Yeon Kim, and David H. Yoon. 2001. "Dirty Pool."
> International Organization, Vol. 55, No. 2. (Spring, 2001), pp.
> 441-468.

Beck, Nathaniel, Jonathan N. Katz and Richard Tucker. 1998. Taking
Time Seriously: Time-Series-Cross-Section Analysis with a Binary
Dependent Variable. American Journal of Political Science, 42:

See also:
ssc install transint
h transint

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index