Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: About taking log on zero values

From   Maarten Buis <>
Subject   Re: st: About taking log on zero values
Date   Thu, 20 Feb 2014 10:57:14 +0100

One option you could also consider is that you treat the value 0 as
special which needs its own effect. This depends whether 0 means
"literaly nothing" or "so small that it could not be detected". In the
former case you would often want to treat the value 0 as qualitatively
different, while in the later case adding a small but not too small
number to the 0 values could be justified.

In case that you would want to treat the value 0 as qualititively
different, then I would do something like this:

gen byte nosales = (sales == 0) if sales < .
gen logsales = ln(sales)
sum logsales, meanonly
replace logsales = r(min) if nosales == 1
reg y x1 x2 logsales nosales

In that case the coefficient for logsales can be interpreted as
before, but refers only to sales > 0. The coefficient for nosales
represents the difference in expected value of y between those units
with no sales at all and those units with the smallest non-zero sales.

Hope this helps,

On Wed, Feb 19, 2014 at 9:11 PM, Nick Cox <> wrote:
> Stata would ignore numeric missings in anything like a regression calculation.
> That applies also to missings that result from calculating log(0).
> Changing values of 0 to values to 1 so that you can take logarithms is
> not something I would call "usual practice". It is, I suspect,
> regarded differently by different people on a spectrum from unethical
> and incorrect to an acceptable fudge, depending partly on the rest of
> the data and what you are doing with them.
> An incomplete list of things to think about:
> 0. If values of 1 occur otherwise, you have created an inconsistency.
> If values between 0 and 1 occur otherwise, you have created a bigger
> one. Applying log(x + 1) consistently solves this problem only by
> creating another. Applying log(x + 1) and pretending that it is really
> applying log(x) is not widely accepted.
> 1. If 0 really means what it says, changing it to 1 is a
> falsification. Whether you can put a spin on it as an acceptable or
> necessary falsification is an open question.
> 2. If 0 really means "small but not detected", changing it to e.g.
> half smallest observable value is sometimes an accepted or acceptable
> modification.
> 3. Replacing log(0) with log(1) is not, necessarily, even a small and
> conservative modification. If apart from the values of 0 values range
> from e3 to e6 then after logging you have 0 and otherwise a range of 3
> to 6. You may have _created_ a bundle of outliers that will dominate
> analyses.
> 4. Doing something about 0s is only necessary with logarithmic
> transformation. If you have 0s in the response, you can leave them and
> use a logarithmic link. That won't necessarily be a good model, but
> using a logarithmic link doesn't require positive values in the
> response, only that the mean function be always positive. (This
> doesn't apply in your case as the variable in question is a
> predictor.)
> 5. There are usually alternatives, such as transformations other than
> logarithms.
> 6. I wouldn't do anything without considering some kind of sensitivity
> analysis, i.e. a consideration of how much difference an arbitrary
> treatment of zeros makes.
> 7. There is often an argument that implies that the observations with
> zeros don't belong any way.
> (I have generalised your question, but suspect that zero values for
> sales usually mean exactly what they say.)
> Nick
> On 19 February 2014 19:44, Sebastian Say
> <> wrote [edited]
>> My question is about how Stata treats a log-transformed variable
>> that draws upon an original variable that contains zero.
>> In my dataset, I have firm sales data but some of them have values of zero. I
>> created a logsales variable and noticed that those with zeros are
>> indicated as a "."
>> I plan to run a regression, e.g.
>> reg y x1 x2 logsales
>> My question is, how would Stata treat these "." if I do not remove them?
>> Technically the "." should be undefined.
>> I've read some papers and they usually put a 1 for those sales data
>> with zeros in them. Is this a usual practice?
> *
> *   For searches and help try:
> *
> *
> *

Maarten L. Buis
Reichpietschufer 50
10785 Berlin
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index