Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Nick Cox <njcoxstata@gmail.com> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | Re: st: About taking log on zero values |
Date | Thu, 20 Feb 2014 10:15:20 +0000 |
This is a good technique. Here is another way to code it. su sales, meanonly gen logsales = cond(sales == 0, log(r(min)), log(sales)) gen nosales = sales == 0 regress y x1 x2 logsales nosales Another way to think about this. regress y x1 x2 and look at the residuals vs sales. predict res, res scatter res sales That might give guidance on how to treat sales. Nick njcoxstata@gmail.com On 20 February 2014 09:57, Maarten Buis <maartenlbuis@gmail.com> wrote: > One option you could also consider is that you treat the value 0 as > special which needs its own effect. This depends whether 0 means > "literaly nothing" or "so small that it could not be detected". In the > former case you would often want to treat the value 0 as qualitatively > different, while in the later case adding a small but not too small > number to the 0 values could be justified. > > In case that you would want to treat the value 0 as qualititively > different, then I would do something like this: > > gen byte nosales = (sales == 0) if sales < . > gen logsales = ln(sales) > sum logsales, meanonly > replace logsales = r(min) if nosales == 1 > reg y x1 x2 logsales nosales > > In that case the coefficient for logsales can be interpreted as > before, but refers only to sales > 0. The coefficient for nosales > represents the difference in expected value of y between those units > with no sales at all and those units with the smallest non-zero sales. > > Hope this helps, > Maarten > > > On Wed, Feb 19, 2014 at 9:11 PM, Nick Cox <njcoxstata@gmail.com> wrote: >> Stata would ignore numeric missings in anything like a regression calculation. >> >> That applies also to missings that result from calculating log(0). >> >> Changing values of 0 to values to 1 so that you can take logarithms is >> not something I would call "usual practice". It is, I suspect, >> regarded differently by different people on a spectrum from unethical >> and incorrect to an acceptable fudge, depending partly on the rest of >> the data and what you are doing with them. >> >> An incomplete list of things to think about: >> >> 0. If values of 1 occur otherwise, you have created an inconsistency. >> If values between 0 and 1 occur otherwise, you have created a bigger >> one. Applying log(x + 1) consistently solves this problem only by >> creating another. Applying log(x + 1) and pretending that it is really >> applying log(x) is not widely accepted. >> >> 1. If 0 really means what it says, changing it to 1 is a >> falsification. Whether you can put a spin on it as an acceptable or >> necessary falsification is an open question. >> >> 2. If 0 really means "small but not detected", changing it to e.g. >> half smallest observable value is sometimes an accepted or acceptable >> modification. >> >> 3. Replacing log(0) with log(1) is not, necessarily, even a small and >> conservative modification. If apart from the values of 0 values range >> from e3 to e6 then after logging you have 0 and otherwise a range of 3 >> to 6. You may have _created_ a bundle of outliers that will dominate >> analyses. >> >> 4. Doing something about 0s is only necessary with logarithmic >> transformation. If you have 0s in the response, you can leave them and >> use a logarithmic link. That won't necessarily be a good model, but >> using a logarithmic link doesn't require positive values in the >> response, only that the mean function be always positive. (This >> doesn't apply in your case as the variable in question is a >> predictor.) >> >> 5. There are usually alternatives, such as transformations other than >> logarithms. >> >> 6. I wouldn't do anything without considering some kind of sensitivity >> analysis, i.e. a consideration of how much difference an arbitrary >> treatment of zeros makes. >> >> 7. There is often an argument that implies that the observations with >> zeros don't belong any way. >> >> (I have generalised your question, but suspect that zero values for >> sales usually mean exactly what they say.) >> >> Nick >> njcoxstata@gmail.com >> >> On 19 February 2014 19:44, Sebastian Say >> <sebastian.statalist@gmail.com> wrote [edited] >> >>> My question is about how Stata treats a log-transformed variable >>> that draws upon an original variable that contains zero. >>> >>> In my dataset, I have firm sales data but some of them have values of zero. I >>> created a logsales variable and noticed that those with zeros are >>> indicated as a "." >>> >>> I plan to run a regression, e.g. >>> >>> reg y x1 x2 logsales >>> >>> My question is, how would Stata treat these "." if I do not remove them? >>> >>> Technically the "." should be undefined. >>> >>> I've read some papers and they usually put a 1 for those sales data >>> with zeros in them. Is this a usual practice? >> * >> * For searches and help try: >> * http://www.stata.com/help.cgi?search >> * http://www.stata.com/support/faqs/resources/statalist-faq/ >> * http://www.ats.ucla.edu/stat/stata/ > > > > -- > --------------------------------- > Maarten L. Buis > WZB > Reichpietschufer 50 > 10785 Berlin > Germany > > http://www.maartenbuis.nl > --------------------------------- > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/faqs/resources/statalist-faq/ > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/