Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: About taking log on zero values


From   Nick Cox <[email protected]>
To   "[email protected]" <[email protected]>
Subject   Re: st: About taking log on zero values
Date   Thu, 20 Feb 2014 10:15:20 +0000

This is a good technique. Here is another way to code it.

su sales, meanonly
gen logsales = cond(sales == 0, log(r(min)), log(sales))
gen nosales = sales == 0
regress y x1 x2 logsales nosales

Another way to think about this.

regress y x1 x2

and look at the residuals vs sales.

predict res, res
scatter res sales

That might give guidance on how to treat sales.
Nick
[email protected]


On 20 February 2014 09:57, Maarten Buis <[email protected]> wrote:
> One option you could also consider is that you treat the value 0 as
> special which needs its own effect. This depends whether 0 means
> "literaly nothing" or "so small that it could not be detected". In the
> former case you would often want to treat the value 0 as qualitatively
> different, while in the later case adding a small but not too small
> number to the 0 values could be justified.
>
> In case that you would want to treat the value 0 as qualititively
> different, then I would do something like this:
>
> gen byte nosales = (sales == 0) if sales < .
> gen logsales = ln(sales)
> sum logsales, meanonly
> replace logsales = r(min) if nosales == 1
> reg y x1 x2 logsales nosales
>
> In that case the coefficient for logsales can be interpreted as
> before, but refers only to sales > 0. The coefficient for nosales
> represents the difference in expected value of y between those units
> with no sales at all and those units with the smallest non-zero sales.
>
> Hope this helps,
> Maarten
>
>
> On Wed, Feb 19, 2014 at 9:11 PM, Nick Cox <[email protected]> wrote:
>> Stata would ignore numeric missings in anything like a regression calculation.
>>
>> That applies also to missings that result from calculating log(0).
>>
>> Changing values of 0 to values to 1 so that you can take logarithms is
>> not something I would call "usual practice". It is, I suspect,
>> regarded differently by different people on a spectrum from unethical
>> and incorrect to an acceptable fudge, depending partly on the rest of
>> the data and what you are doing with them.
>>
>> An incomplete list of things to think about:
>>
>> 0. If values of 1 occur otherwise, you have created an inconsistency.
>> If values between 0 and 1 occur otherwise, you have created a bigger
>> one. Applying log(x + 1) consistently solves this problem only by
>> creating another. Applying log(x + 1) and pretending that it is really
>> applying log(x) is not widely accepted.
>>
>> 1. If 0 really means what it says, changing it to 1 is a
>> falsification. Whether you can put a spin on it as an acceptable or
>> necessary falsification is an open question.
>>
>> 2. If 0 really means "small but not detected", changing it to e.g.
>> half smallest observable value is sometimes an accepted or acceptable
>> modification.
>>
>> 3. Replacing log(0) with log(1) is not, necessarily, even a small and
>> conservative modification. If apart from the values of 0 values range
>> from e3 to e6 then after logging you have 0 and otherwise a range of 3
>> to 6. You may have _created_ a bundle of outliers that will dominate
>> analyses.
>>
>> 4. Doing something about 0s is only necessary with logarithmic
>> transformation. If you have 0s in the response, you can leave them and
>> use a logarithmic link. That won't necessarily be a good model, but
>> using a logarithmic link doesn't require positive values in the
>> response, only that the mean function be always positive. (This
>> doesn't apply in your case as the variable in question is a
>> predictor.)
>>
>> 5. There are usually alternatives, such as transformations other than
>> logarithms.
>>
>> 6. I wouldn't do anything without considering some kind of sensitivity
>> analysis, i.e. a consideration of how much difference an arbitrary
>> treatment of zeros makes.
>>
>> 7. There is often an argument that implies that the observations with
>> zeros don't belong any way.
>>
>> (I have generalised your question, but suspect that zero values for
>> sales usually mean exactly what they say.)
>>
>> Nick
>> [email protected]
>>
>> On 19 February 2014 19:44, Sebastian Say
>> <[email protected]> wrote [edited]
>>
>>> My question is about how Stata treats a log-transformed variable
>>> that draws upon an original variable that contains zero.
>>>
>>> In my dataset, I have firm sales data but some of them have values of zero. I
>>> created a logsales variable and noticed that those with zeros are
>>> indicated as a "."
>>>
>>> I plan to run a regression, e.g.
>>>
>>> reg y x1 x2 logsales
>>>
>>> My question is, how would Stata treat these "." if I do not remove them?
>>>
>>> Technically the "." should be undefined.
>>>
>>> I've read some papers and they usually put a 1 for those sales data
>>> with zeros in them. Is this a usual practice?
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/faqs/resources/statalist-faq/
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> ---------------------------------
> Maarten L. Buis
> WZB
> Reichpietschufer 50
> 10785 Berlin
> Germany
>
> http://www.maartenbuis.nl
> ---------------------------------
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index