Hi statalist and hi Bill!
I had the same problem as Ahmed (in 2003...) and your answer was extremly
helpfull. Just one question remains: I want to refer to the fact that "In
the -xtreg, fe- calculation, we are washing out the explanatory effects of
the intercepts" in my paper, but actually I would prefer to have a reference
for that. Do you or does somebody know whom I can cite?
Greetings Nina
Ahmed Diesel <ahmed.diesel@gmx.de> asked,
> Why is R-squared in panel data models always very low (so that everybody
is
> happy about an R-squared of 10%)? I don't find any explanation about that
> in the literature.
Ahmed has asked a very deep question and one deserving of an answer.
The first answer, of course, is what is acceptable depends on your science,
but some people, hearing that answer, may think that means standards are
lower
in some sciences than others, so some scientists get away with things that
other scientists couldn't dream of doing.
That would be a misinterpreation of the short answer. For different
sciences,
what is acceptable can vary because of the nature of the problem itself. It
all depends on where the "noise" in the data resides, as I will explain. In
one science, a result with a "poor" R-squared may in fact contain much more
information than, in another science, a result with an R-squared near one.
I suspect Ahmed is an economist so I am going to answer using an economic
example. Well, that's only half the reason. I was trained as an economist.
Anyway, it is easy enough to recast my answer to other sciences and it is
rather fun to do that, because what turns out to be important and
unimportant
can change.
The argument below has two parts:
1. (Substantive) It should not surprise you that if you compare
Ahmed's earnings with his own earnings at different times, you can
explain much of the variation with a few varables. If you compare
Ahmed's earnings to, say, Nick Cox's earnings, those same few
variables will explain less.
2. (Calculation) The R-squared reported with panel-dataset models is
cross-sectional like rather than time-series like; it can be
compared
with R-squareds from cross-sectional regressions but cannot, without
adjustment, be compared to R-squareds from time-series models.
Cross-sectional economic data
-----------------------------
A classic problem is annual earnings as a function of eductional attainment,
age, and labor-market experience:
ln(earnings_i) = a + b*ed_i
+ c1*age_i + c2*(age_i)^2
+ d1*exp_i + d2*(exp_i)^2
+ u_i
Now clearly, there are thousands of other things that that affect the level
of
earnings other than educational attainment, age, and labor-market
experience,
and those things will vary from person to person in the data. All those
things are wrapped together in the residual, along with pure luck:
u_i = e*Z_i + pureluck_i
You should not be surprised when this simple model does a poor job,
absolutely
speaking, at explaining the level of earnings. Consider the subset of the
data, persons with ed=16 (college graduate), age 35, and all having worked
14
years; you know you will see considerable variation in their earnings.
That, however, is not a criticism of the model. It may turn out that we
accurately estimate a large effect for educational attainment. That would
be
useful information: we may not be able to explain across people the overall
level of earnings very accurately, but we might very accurately be able to
measure the effect of education.
Time-series data
----------------
Now let's consider the same problem but this time, use time-series data to
estimate it. What we are going to do is take one person from our cross
sectional dataset (say the first person), and collect data over time, and
then
estimate:
ln(earnings_1t) = a + b*ed_1t
+ c1*age_1t + c2*(age_1t)^2
+ d1*exp_1t + d2*(exp_1t)^2
+ u_1t
I assert that, if you do this, you will find that you can explain the
variation in earnings very well: R-squared will be high. The reason for
that
is that, this time, it will be the coefficient "a" rather than than the
residual u which will include the 1,000s of variables that we did not
measure.
As a technical note, let me say that in the formulas we use to calculate
R-squared, we do not really assign any explanatory power to the intercept,
but
that is misleading, because neither do we, in the data, ever observe any
variation across person -- there is only one person. Thus, the net result
is
as if we did assign explanatory power to the intercept in the sense of
cross-person variation.
I'll give you the math, but before that, just think about it. You take
Ahmed Diesel and collect his earnings over time. Now you set about
"explaining" his earnings. Ahmed's average earnings, by itself, will
provide
lots of explanatory power. Indeed, over a short enough period, Ahmed's
average earnings might be constant, in which case we would have an R-squared
of 1.
The math
--------
Let us now do the math. I will tell you that ln(earnings_it), for any
person
i in the world, at any time t, is given by
ln(earnings_it) = a + b*S_i (things about the person)
+ c*S_t (things about the time)
+ d*S_it (things about the person and
time)
+ e_it (a little noise)
Let me tell you that this model is very complete: I have talked not only
to economists, but psychologists, epidemiologists, and even physicists. In
fact, it was not until I talked to the physicists and they told me about
quantum effects that I had any randomness in the model at all. This model
has everything.
The problem with this model is that I have no hope of measuring most of the
variables contained in S_i, S_t, and S_it.
Still, I set about estimating this model. First, I will use cross-sectional
data. I will use data for t=2002. The first thing that happens to this
model
is that I lose all variation in time, so let me recollect terms:
ln(earnings_i,2002) = a + c*S_2002 (intercept)
+ c*S_i + d*S_i,2002 (things about person)
+ e_i,2002 (noise)
Understand what just happened here: for t=2002, S_t = S_2002 is just a set
of
values that do not vary, so c*S_2002 becomes a single constant value. Let
me
write the above:
Ln(earnings_i) = a + c*S_2002 intercept
+ c' * T_i T_i = (S_i, S_i,2002)
+ noise_i
what I next do is divide T_i into that which I can measure and that which
I cannot:
T_i = (M_i, U_i)
and that leads to
Ln(earnings_i) = a + c*S_2002 <- intercept
+ c1 * M_i
+ noise_i + c2*U_i <- resulting residual
and there is the model I can estimate. The residual contains every
person-specific thing I cannot measure, and the intercept contains every
time-specfic thing (measurable or not). (Economists: I have swept under
the
rug issues of the correlation of variables in the model; this is not
important
for explaining R-squared.)
Now let's do the time-series model. I start with the same model,
ln(earnings_it) = a + b*S_i (things about the person)
+ c*S_t (things about the time)
+ d*S_it (things about the person and
time)
+ e_it (a little noise)
and this time I set i=1 and end up with
Ln(earnings_t) = a + c*S_1 <- intercept
+ c1 * M_t
+ noise_t + c2*U_t <- resulting residual
This time, the intercept contains all every person-specific things
(measurable
or not), and the residual contains every time-specific thing I cannot
measure.
Panel datasets
--------------
You can do the math for a panel dataset yourself. It's rather fun, but
you end up with lots of terms as you divide each separate piece into
observable and nonobservable.
Regardless, panel datasets are just a combination of cross-sectional and
time-series datasets, and so you should expect the reported explanatory
power
of panel datasets to lie in between.
In fact, however, there is one more thing you need to know: when we
calculate
the explanatory power, we assign no explanatory power to the individual
intercepts. This is no different from usual. What is different from the
time-series case is that we do have variation across person, so we are in
effect reporting a "cross-sectional like" R-squared. The reported R-squared
can be compared with R-squareds from cross sectional models, but not with
time-series models.
To better understand this detail, try the following experiment: run a
fixed-effects model that has just a few fixed effects. First run it
using -xtreg, fe-. Write down the R-squareds. Now run it using linear
regression, creating the dummy variables for each of the persons for
yourself.
All results will be the same, except that the reported R-squared will be
much
higher. In the -xtreg, fe- calculation, we are washing out the explanatory
effects of the intercepts. If you just run it using linear regression,
those
explantory effects are not removed.
-- Bill
wgould@stata.com
--
Nina Karstens
Department of Food Economics & Consumption Studies
University of Kiel
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/