[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: st: Log transformation and related issues

From   "Nick Cox" <>
To   <>
Subject   Re: Re: st: Log transformation and related issues
Date   Wed, 12 Aug 2009 21:07:07 +0100

Just to add to Maarten's sage advice that how, or indeed whether, to
take logarithms of zero in some roundabout way is a frequent question on
this list. See for example a thread last month: 



Maarten buis <>

--- On Tue, 11/8/09, Fardad Zand wrote:
> In my econometric specification, I'm using the total number
> of employees (EMP) and the share of highly educated 
> employees (EDU) as two important explanatory variables of
> my analysis. These are coming directly from two separate
> survey questions asking about the number of employees and
> the share of employees with a university degree. I'm now
> encountering the following three problems:
> 1- Adding EMP and EDU in the model may lead to some sort of
> systematic negative correlation between the two variables
> as EDU in essence equals: # of highly educated employees
> /EMP.

The fact that explanatory variables are correlated is not a 
problem, except when the correlation becomes perfect, in 
which case we can't distinguish between the variables and we 
than obviously can't compute separate effects for each of 
these variables. In fact this correlation between explanatory 
variables is the very reason why we do a regression with 
multiple explanatory variables: it is this correlation that 
makes a variable a confounding variable which needs to be 
controlled for.

>  However, EDU in my survey is not calculated but directly
> asked. Thus, that will reduce the problem compared to a
> situation of artificial correlation by construction. Yet,
> do you still find it problematic to add EMP and EDU at
> the same time? 

The reduction in the correlation is due to extra measurement 
error, which is not a solution but a problem. However, the 
trick to get research done is to worry about one problem at 
the time, so I recommend you forget this problem (for now).

> 2- As a solution, I can rely on logarithmic transformation
> and add ln_EMP and ln_EDU into regression; this way, the
> inherit correlation manifest itself in the corresponding 
> estimated coefficients of these two variables. The log
> transformation is indeed a good solution from another
> reason as well. These two variables are highly skewed and
> log can reduce the effect of outliers (and I can see that
> by obtaining totally different results when I use log).
> However, the main problem is the very high number of zeros
> for EDU variable; this way, taking log will drop these
> observations out of the analysis, which will bias the
> sample and results (about 25% of the sample is discarded
> this way). A solution is to impute zero values of EDU with
> a very small number, say 1 e-06. Is this a scientifically
> valid approach? What are the alternatives?

If you are worried about outliers than adding such a small 
number is an absolutely horrible approach as now you are 
adding an outlier to the left side of your distribution. The 
reason for transforming your explanatory variable should be 
because the effect of that variable is non-linear, so your 
first port of call should be a scatter plot of EMP on the 
x-axis and your dependent variable on the y-axis and another 
scatter plot of EDU on the x-axis and your dependent variable 
on the y-axis. Than just look at what kind of functional form 
of the relationship would make sense for these variables. 
Usually a log transform of size makes sense, but I am not so 
sure if the same thing is true for a proportion variables. 
However, that is an empirical question, so just take a look. 
One thing you could look at is whether the zero proportions 
are qualitatively different from the rest, which you could 
represent with a linear effect of EDU combined with a dummy 
for EDU==0. You can always represent the relationship in a 
flexible non-linear way using for example restricted cubic 
splines (see: -help mkspline- and or fractional 
polynomials (see: -help fracpoly-).

> 3- Finally, we might come to the conclusion that EDU is
> better to be transferred manually to the # of highly
> educated employees (by multiplying EDU to EMP) and then
> apply log on EMP only (to avoid the problem of many
> zeros for EDU) and then add ln_EMP and # of highly
> educated employees into the regression. Scientifically
> speaking, is it wise to add EMP in log form but #
> high-educated employees in absolute level?

There is no reason why you can not do that. The reason to 
choose between functional forms has to do with how the 
explanatory variable influences the explained variable, as 
was discussed above.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index