[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
Re: Re: st: Log transformation and related issues |

Date |
Wed, 12 Aug 2009 21:07:07 +0100 |

Just to add to Maarten's sage advice that how, or indeed whether, to take logarithms of zero in some roundabout way is a frequent question on this list. See for example a thread last month: <http://www.hsph.harvard.edu/cgi-bin/lwgate/STATALIST/archives/statalist .0907/Author/article-678.html> Nick n.j.cox@durham.ac.uk Maarten buis <maartenbuis@yahoo.co.uk> --- On Tue, 11/8/09, Fardad Zand wrote: > In my econometric specification, I'm using the total number > of employees (EMP) and the share of highly educated > employees (EDU) as two important explanatory variables of > my analysis. These are coming directly from two separate > survey questions asking about the number of employees and > the share of employees with a university degree. I'm now > encountering the following three problems: > > 1- Adding EMP and EDU in the model may lead to some sort of > systematic negative correlation between the two variables > as EDU in essence equals: # of highly educated employees > /EMP. The fact that explanatory variables are correlated is not a problem, except when the correlation becomes perfect, in which case we can't distinguish between the variables and we than obviously can't compute separate effects for each of these variables. In fact this correlation between explanatory variables is the very reason why we do a regression with multiple explanatory variables: it is this correlation that makes a variable a confounding variable which needs to be controlled for. > However, EDU in my survey is not calculated but directly > asked. Thus, that will reduce the problem compared to a > situation of artificial correlation by construction. Yet, > do you still find it problematic to add EMP and EDU at > the same time? The reduction in the correlation is due to extra measurement error, which is not a solution but a problem. However, the trick to get research done is to worry about one problem at the time, so I recommend you forget this problem (for now). > 2- As a solution, I can rely on logarithmic transformation > and add ln_EMP and ln_EDU into regression; this way, the > inherit correlation manifest itself in the corresponding > estimated coefficients of these two variables. The log > transformation is indeed a good solution from another > reason as well. These two variables are highly skewed and > log can reduce the effect of outliers (and I can see that > by obtaining totally different results when I use log). > However, the main problem is the very high number of zeros > for EDU variable; this way, taking log will drop these > observations out of the analysis, which will bias the > sample and results (about 25% of the sample is discarded > this way). A solution is to impute zero values of EDU with > a very small number, say 1 e-06. Is this a scientifically > valid approach? What are the alternatives? If you are worried about outliers than adding such a small number is an absolutely horrible approach as now you are adding an outlier to the left side of your distribution. The reason for transforming your explanatory variable should be because the effect of that variable is non-linear, so your first port of call should be a scatter plot of EMP on the x-axis and your dependent variable on the y-axis and another scatter plot of EDU on the x-axis and your dependent variable on the y-axis. Than just look at what kind of functional form of the relationship would make sense for these variables. Usually a log transform of size makes sense, but I am not so sure if the same thing is true for a proportion variables. However, that is an empirical question, so just take a look. One thing you could look at is whether the zero proportions are qualitatively different from the rest, which you could represent with a linear effect of EDU combined with a dummy for EDU==0. You can always represent the relationship in a flexible non-linear way using for example restricted cubic splines (see: -help mkspline- and http://ideas.repec.org/p/boc/dsug09/04.html) or fractional polynomials (see: -help fracpoly-). > 3- Finally, we might come to the conclusion that EDU is > better to be transferred manually to the # of highly > educated employees (by multiplying EDU to EMP) and then > apply log on EMP only (to avoid the problem of many > zeros for EDU) and then add ln_EMP and # of highly > educated employees into the regression. Scientifically > speaking, is it wise to add EMP in log form but # > high-educated employees in absolute level? There is no reason why you can not do that. The reason to choose between functional forms has to do with how the explanatory variable influences the explained variable, as was discussed above. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: re: ivreg2 results do not replicate from stata 9 to stata 11** - Next by Date:
**st: RE: Obtain the list of variables used in an if condition** - Previous by thread:
**Re: st: Log transformation and related issues** - Next by thread:
**st: Switch off labels in -table-, how?** - Index(es):

© Copyright 1996–2015 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |