[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Artificial correlation, Log transformation and Zeros

From   Fardad Zand <>
Subject   st: Artificial correlation, Log transformation and Zeros
Date   Tue, 11 Aug 2009 17:21:54 +0200

Dear Listers,

In my econometric specification, I'm using the total number of
employees (EMP) and the share of highly educated employees (EDU) as
two important explanatory variables of my analysis. These are coming
directly from two separate survey questions asking about the number of
employees and the share of employees with a university degree. I'm now
encountering the following three problems:

1- Adding EMP and EDU in the model may lead to some sort of systematic
negative correlation between the two variables as EDU in essence
equals: # of highly educated employees /EMP. However, EDU in my survey
is not calculated but directly asked. Thus, that will reduce the
problem compared to a situation of artificial correlation by
construction. Yet, do you still find it problematic to add EMP and EDU
at the same time? If yes, what would you recommend?

2- As a solution, I can rely on logarithmic transformation and add
ln_EMP and ln_EDU into regression; this way, the inherit correlation
manifest itself in the corresponding estimated coefficients of these
two variables. The log transformation is indeed a good solution from
another reason as well. These two variables are highly skewed and log
can reduce the effect of outliers (and I can see that by obtaining
totally different results when I use log). However, the main problem
is the very high number of zeros for EDU variable; this way, taking
log will drop these observations out of the analysis, which will bias
the sample and results (about 25% of the sample is discarded this
way). A solution is to impute zero values of EDU with a very small
number, say 1 e-06. Is this a scientifically valid approach? What are
the alternatives?

3- Finally, we might come to the conclusion that EDU is better to be
transferred manually to the # of highly educated employees (by
multiplying EDU to EMP) and then apply log on EMP only (to avoid the
problem of many zeros for EDU) and then add ln_EMP and # of highly
educated employees into the regression. Scientifically speaking, is it
wise to add EMP in log form but # high-educated employees in absolute

Overall, what would be your suggested specification, concerning all
the above issues?

I so much appreciate your support and guidelines.

Best wishes,
*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index