Let me summarise the situation as I see it. This tries
a slightly more general pitch than the current thread.
I want to work with the logarithm of a variable, but that
variable contains zero values. What should I do?
---------------------------------------------------------
Be warned: this is treacherous territory in which competent
and experienced people can disagree. What is best to do will
depend on your research problem and on your data, so no
unequivocal advice can be given.
Be further warned: if you also have negative values, your
problem is even worse. Often, the short answer is just that
you are thinking of something that is not a good idea. Some
of the points below apply, sometimes with modifications to fit.
1. The logarithm of a variable is not determinate whenever
that variable is zero. In Stata (or Mata) log(0) returns
. (numeric missing).
2. There are many statistical and scientific reasons
to work with logged variables as either responses or predictors.
These range from situations in which theory suggests a model
that can be linearised by logarithmic transformation (e.g.
radioactive decay, exponential growth) to situations in which
logarithmic transformation is defensible empirically as helping
to improve behaviour, e.g. by making relationships more nearly linear.
3. Although it is not as widely known as it should be, the fact
that some values of a response are zero is in itself no barrier
to fitting a model that has logarithmic link function (in
generalized linear model terminology). This is so whenever the
key assumption is how log(mean response) varies, rather than how
mean(log response) varies. This is the case with, in particular,
Poisson regression and more widely with generalized linear models.
This parallels a much more widely known point: the fact that
logit of 0 and logit of 1 are indeterminate is irrelevant to
the applicability of logit models for binary responses. Naturally,
the fact that you can apply such models does not establish that
they work well or are justifiable otherwise.
4. Perhaps surprisingly, zeros in a predictor that you want
to take logarithms of pose a more awkward problem. No one seems
to have a solution that does not in turn seem problematic.
(a) The zeros are in your sample, but do they belong in your
target population? Perhaps you should exclude those observations.
The issue here is one of relevance: data irrelevant to a problem
do not belong in a modelling exercise. If you were working on
cats, and some dogs wandered into your dataset, you would usually
exclude them as the result of a simple mistake. But what
determines relevance? If your predictor is coffee consumption,
people who drink no coffee would usually be considered relevant:
they give information on boundary or limiting conditions, part
of the specification of a well-posed problem.
(b) If you were confident that the zeros were really a kind
of missing value, imputation might appear justifiable. In
environmental data, zero concentrations of something nasty might
just mean failure to detect very small concentrations with the
technology available. With some social science variables, zeros
might just imply that people lied, or more generally that data
production had failed to yield valid measurements. There are
occasional media reports of people claiming not to eat indefinitely,
and the only real issue is how are they cheating. Nevertheless
celibates and abstainers of all kinds do exist. Your field may
have literature on how best to deal
with data that are best regarded as censored or truncated.
(c) Some people fudge or nudge the zeros to small positive values.
This is more defensible if it is a correction for a measurement
problem (see (b) above) than because it is thought the only way
to apply logarithms. Either way, it really needs to be
flagged in reports and well defended too. If data are integers,
so that the smallest valid values are 1 and 2, then nudging 0
to 0.5 has some appeal as not only a compromise between 0 and 1
but -- more to the point here -- implying that successive
values 0.5, 1, 2 are equally spaced on a logarithmic scale.
But the appeal here is at best one of simplicity or symmetry,
does not apply beyond 2, and does not reflect a statistical argument. More
generally, the idea is to replace the zeros by half the
smallest non-zero measurement, given some convention about
resolution of measurement (e.g. to a fixed number of decimal
places using agreed units). Note particularly that very
small nudges are more problematic, not less, because after
logging they place the values that were zeros further away
from the others, and so are very likely to produce outliers
on that variable.
(d) A related idea is to adjust all variables by a transformation
of the form log(x + constant). A constant of 1 appeals to
some, especially if the variable x is a count, but again this is
more a matter of simplicity than anything else. log(x + 1)
goes to 0 as x goes to 0 and behaves like log(x) for large x.
Estimating the constant from the data appeals to others, especially with
the rhetoric of letting the data themselves indicate the best transformation.
A trick of forcing the constant to be positive by a parameterisation like
log(x + exp(alpha)) -- so that exp(alpha) will always be positive,
regardless of whether alpha is --
is objectionable on dimensional grounds, unless x is a pure number.
This is because the result of exponentiation is a dimensionless number
without units and so can be validly added only to another dimensionless
number. Either way, using log(x + constant) rather than
log(x) gives a different model, not a different way of using log(x)
as a predictor. There is, however, scope for a sensitivity analysis:
vary the constant, and see how far the model results vary.
Nick
n.j.cox@durham.ac.uk
Austin Nichols
> In re: adding alpha to X to make ln(X) nonmissing
> Why does this operation come up so often, when it is so often a bad
> idea? I have seen several papers this week that add some constant to
> X so that ln(X) can be regressed on some variables, or some variable
> can be regressed on it. Wouldn't you be just as well off imputing
> 2*atan(X)-2*atan(1) or somesuch? Is there a well-known good reference
> on this subject?
>
> Just now, when looking up the ref for an adjacent thread on btscs.ado,
> I ran across Oneal & Russett (2001) which acknowledges that Beck,
> Katz, and Tucker (1998) pointed out an error, and then replies to
> another critique with this (p.480):
> "
> Before taking the logarithm [of trade volume in $millions] we assigned
> a different value to the trade variable for dyads that report no
> trade. Some value must be imputed because the logarithm of zero is
> undefined. We use $100,000 [so really it was ln(0.1)]; Green, Kim,
> and Yoon used $1. It is this that accounts for most of the
> differences between our results and theirs.
> "
> Oneal, John R. and Bruce Russett. 2001. Clear and Clean: The Fixed
> Effects of the Liberal Peace. International Organization, Vol. 55, No.
> 2. (Spring, 2001), pp. 469-485.
> http://links.jstor.org/sici?sici=0020-8183%28200121%2955%3A2%3
> C469%3ACACTFE%3E2.0.CO%3B2-A
>
> Green, Donald P., Soo Yeon Kim, and David H. Yoon. 2001. "Dirty Pool."
> International Organization, Vol. 55, No. 2. (Spring, 2001), pp.
> 441-468.
> http://links.jstor.org/sici?sici=0020-8183%28200121%2955%3A2%3
C441%3ADP%3E2.0.CO%3B2-N
Beck, Nathaniel, Jonathan N. Katz and Richard Tucker. 1998. Taking
Time Seriously: Time-Series-Cross-Section Analysis with a Binary
Dependent Variable. American Journal of Political Science, 42:
1260-1288.
See also:
ssc install transint
h transint
http://www.stata.com/statalist/archive/2006-11/msg00294.html
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/