Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: adding to X to make ln(X) nonmissing [was BTSCS and Non-linear MLE programming]


From   "Benito, Andrew" <ABenito@imf.org>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: RE: adding to X to make ln(X) nonmissing [was BTSCS and Non-linear MLE programming]
Date   Fri, 2 Feb 2007 09:26:49 -0500

Hi.
I wld suggest looking at the following reference

    Burbidge, J.B., Magee, L. and Robb, L. (1988), `Alternative Transformations to Handle Extreme Values of the Dependent Variable', Journal of the American Statistical Association, 83, 123-7.
    
It suggests using an Inverse hyperbolic sine function to replace the log in such cases. (related to Nick's point(d)). The functional form, with the dampening factor set to 1, is sinh⁻¹(x)=ln(n+√(1+x²)).  But more generally (ie without the dampening factor set in that way), it involves non-linear estimation.

Andrew 



-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox
Sent: Friday, February 02, 2007 9:08 AM
To: statalist@hsphsun2.harvard.edu
Subject: st: RE: adding to X to make ln(X) nonmissing [was BTSCS and Non-linear MLE programming]

Let me summarise the situation as I see it. This tries a slightly more general pitch than the current thread. 

I want to work with the logarithm of a variable, but that variable contains zero values. What should I do? 
---------------------------------------------------------

Be warned: this is treacherous territory in which competent and experienced people can disagree. What is best to do will depend on your research problem and on your data, so no unequivocal advice can be given. 

Be further warned: if you also have negative values, your problem is even worse. Often, the short answer is just that you are thinking of something that is not a good idea. Some of the points below apply, sometimes with modifications to fit. 

1. The logarithm of a variable is not determinate whenever that variable is zero. In Stata (or Mata) log(0) returns . (numeric missing). 

2. There are many statistical and scientific reasons to work with logged variables as either responses or predictors. 
These range from situations in which theory suggests a model that can be linearised by logarithmic transformation (e.g. 
radioactive decay, exponential growth) to situations in which logarithmic transformation is defensible empirically as helping to improve behaviour, e.g. by making relationships more nearly linear. 

3. Although it is not as widely known as it should be, the fact that some values of a response are zero is in itself no barrier to fitting a model that has logarithmic link function (in generalized linear model terminology). This is so whenever the key assumption is how log(mean response) varies, rather than how mean(log response) varies. This is the case with, in particular, Poisson regression and more widely with generalized linear models. 
This parallels a much more widely known point: the fact that logit of 0 and logit of 1 are indeterminate is irrelevant to the applicability of logit models for binary responses. Naturally, the fact that you can apply such models does not establish that they work well or are justifiable otherwise. 

4. Perhaps surprisingly, zeros in a predictor that you want to take logarithms of pose a more awkward problem. No one seems to have a solution that does not in turn seem problematic. 

(a) The zeros are in your sample, but do they belong in your target population? Perhaps you should exclude those observations. 
The issue here is one of relevance: data irrelevant to a problem do not belong in a modelling exercise. If you were working on cats, and some dogs wandered into your dataset, you would usually exclude them as the result of a simple mistake. But what determines relevance? If your predictor is coffee consumption, people who drink no coffee would usually be considered relevant:
they give information on boundary or limiting conditions, part of the specification of a well-posed problem. 

(b) If you were confident that the zeros were really a kind of missing value, imputation might appear justifiable. In environmental data, zero concentrations of something nasty might just mean failure to detect very small concentrations with the technology available. With some social science variables, zeros might just imply that people lied, or more generally that data production had failed to yield valid measurements. There are occasional media reports of people claiming not to eat indefinitely, and the only real issue is how are they cheating. Nevertheless celibates and abstainers of all kinds do exist. Your field may have literature on how best to deal with data that are best regarded as censored or truncated. 

(c) Some people fudge or nudge the zeros to small positive values. 
This is more defensible if it is a correction for a measurement problem (see (b) above) than because it is thought the only way to apply logarithms. Either way, it really needs to be flagged in reports and well defended too. If data are integers, so that the smallest valid values are 1 and 2, then nudging 0 to 0.5 has some appeal as not only a compromise between 0 and 1 but -- more to the point here -- implying that successive values 0.5, 1, 2 are equally spaced on a logarithmic scale. 
But the appeal here is at best one of simplicity or symmetry, does not apply beyond 2, and does not reflect a statistical argument. More generally, the idea is to replace the zeros by half the smallest non-zero measurement, given some convention about resolution of measurement (e.g. to a fixed number of decimal places using agreed units). Note particularly that very small nudges are more problematic, not less, because after logging they place the values that were zeros further away from the others, and so are very likely to produce outliers on that variable. 

(d) A related idea is to adjust all variables by a transformation of the form log(x + constant). A constant of 1 appeals to some, especially if the variable x is a count, but again this is more a matter of simplicity than anything else. log(x + 1) goes to 0 as x goes to 0 and behaves like log(x) for large x. 
Estimating the constant from the data appeals to others, especially with the rhetoric of letting the data themselves indicate the best transformation. 
A trick of forcing the constant to be positive by a parameterisation like log(x + exp(alpha)) -- so that exp(alpha) will always be positive, regardless of whether alpha is -- is objectionable on dimensional grounds, unless x is a pure number. 
This is because the result of exponentiation is a dimensionless number without units and so can be validly added only to another dimensionless number. Either way, using log(x + constant) rather than
log(x) gives a different model, not a different way of using log(x) as a predictor. There is, however, scope for a sensitivity analysis:
vary the constant, and see how far the model results vary. 

Nick
n.j.cox@durham.ac.uk 

Austin Nichols
 
> In re: adding alpha to X to make ln(X) nonmissing Why does this 
> operation come up so often, when it is so often a bad idea?  I have 
> seen several papers this week that add some constant to X so that 
> ln(X) can be regressed on some variables, or some variable can be 
> regressed on it.  Wouldn't you be just as well off imputing
> 2*atan(X)-2*atan(1) or somesuch?  Is there a well-known good reference 
> on this subject?
> 
> Just now, when looking up the ref for an adjacent thread on btscs.ado, 
> I ran across Oneal & Russett (2001) which acknowledges that Beck, 
> Katz, and Tucker (1998) pointed out an error, and then replies to 
> another critique with this (p.480):
> "
> Before taking the logarithm [of trade volume in $millions] we assigned 
> a different value to the trade variable for dyads that report no 
> trade.  Some value must be imputed because the logarithm of zero is 
> undefined.  We use $100,000 [so really it was ln(0.1)]; Green, Kim, 
> and Yoon used $1.  It is this that accounts for most of the 
> differences between our results and theirs.
> "
> Oneal, John R. and Bruce Russett. 2001. Clear and Clean: The Fixed 
> Effects of the Liberal Peace. International Organization, Vol. 55, No.
> 2. (Spring, 2001), pp. 469-485.
> http://links.jstor.org/sici?sici=0020-8183%28200121%2955%3A2%3
> C469%3ACACTFE%3E2.0.CO%3B2-A
> 
> Green, Donald P., Soo Yeon Kim, and David H. Yoon. 2001. "Dirty Pool."
> International Organization, Vol. 55, No. 2. (Spring, 2001), pp.
> 441-468.
> http://links.jstor.org/sici?sici=0020-8183%28200121%2955%3A2%3
C441%3ADP%3E2.0.CO%3B2-N

Beck, Nathaniel, Jonathan N. Katz and Richard Tucker. 1998. Taking Time Seriously: Time-Series-Cross-Section Analysis with a Binary Dependent Variable. American Journal of Political Science, 42:
1260-1288.

See also:
ssc install transint
h transint
http://www.stata.com/statalist/archive/2006-11/msg00294.html

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index