Statalist The Stata Listserver

[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: R-squared in panel data models

From   "Nina Karstens" <>
Subject   Re: st: R-squared in panel data models
Date   Tue, 7 Mar 2006 10:40:06 +0100 (MET)

Hi statalist and hi Bill!

I had the same problem as Ahmed (in 2003...) and your answer was extremly
helpfull. Just one question remains: I want to refer to the fact that "In
the -xtreg, fe- calculation, we are washing out the explanatory effects of
the intercepts" in my paper, but actually I would prefer to have a reference
for that. Do you or does somebody know whom I can cite? 

Greetings Nina

Ahmed Diesel <> asked, 

> Why is R-squared in panel data models always very low (so that everybody
> happy about an R-squared of 10%)?  I don't find any explanation about that
> in the literature.

Ahmed has asked a very deep question and one deserving of an answer.

The first answer, of course, is what is acceptable depends on your science,
but some people, hearing that answer, may think that means standards are
in some sciences than others, so some scientists get away with things that
other scientists couldn't dream of doing.

That would be a misinterpreation of the short answer.  For different
what is acceptable can vary because of the nature of the problem itself.  It
all depends on where the "noise" in the data resides, as I will explain.  In
one science, a result with a "poor" R-squared may in fact contain much more
information than, in another science, a result with an R-squared near one.

I suspect Ahmed is an economist so I am going to answer using an economic 
example.  Well, that's only half the reason.  I was trained as an economist.
Anyway, it is easy enough to recast my answer to other sciences and it is 
rather fun to do that, because what turns out to be important and
can change.

The argument below has two parts:

    1.  (Substantive) It should not surprise you that if you compare 
        Ahmed's earnings with his own earnings at different times, you can
        explain much of the variation with a few varables.  If you compare
        Ahmed's earnings to, say, Nick Cox's earnings, those same few
        variables will explain less.

    2.  (Calculation) The R-squared reported with panel-dataset models is 
        cross-sectional like rather than time-series like; it can be
        with R-squareds from cross-sectional regressions but cannot, without
        adjustment, be compared to R-squareds from time-series models.

Cross-sectional economic data

A classic problem is annual earnings as a function of eductional attainment,

age, and labor-market experience:

   ln(earnings_i) = a  +  b*ed_i 
                       +  c1*age_i + c2*(age_i)^2 
                       +  d1*exp_i + d2*(exp_i)^2 
                       +  u_i

Now clearly, there are thousands of other things that that affect the level
earnings other than educational attainment, age, and labor-market
and those things will vary from person to person in the data.  All those
things are wrapped together in the residual, along with pure luck:

              u_i = e*Z_i + pureluck_i

You should not be surprised when this simple model does a poor job,
speaking, at explaining the level of earnings.  Consider the subset of the
data, persons with ed=16 (college graduate), age 35, and all having worked
years; you know you will see considerable variation in their earnings.

That, however, is not a criticism of the model.  It may turn out that we
accurately estimate a large effect for educational attainment.  That would
useful information:  we may not be able to explain across people the overall
level of earnings very accurately, but we might very accurately be able to
measure the effect of education.

Time-series data

Now let's consider the same problem but this time, use time-series data to
estimate it.  What we are going to do is take one person from our cross
sectional dataset (say the first person), and collect data over time, and

   ln(earnings_1t) = a  +  b*ed_1t
                        +  c1*age_1t + c2*(age_1t)^2 
                        +  d1*exp_1t + d2*(exp_1t)^2 
                        +  u_1t

I assert that, if you do this, you will find that you can explain the
variation in earnings very well:  R-squared will be high.  The reason for
is that, this time, it will be the coefficient "a" rather than than the
residual u which will include the 1,000s of variables that we did not

As a technical note, let me say that in the formulas we use to calculate
R-squared, we do not really assign any explanatory power to the intercept,
that is misleading, because neither do we, in the data, ever observe any
variation across person -- there is only one person.  Thus, the net result
as if we did assign explanatory power to the intercept in the sense of
cross-person variation.

I'll give you the math, but before that, just think about it.  You take 
Ahmed Diesel and collect his earnings over time.   Now you set about 
"explaining" his earnings.  Ahmed's average earnings, by itself, will
lots of explanatory power.  Indeed, over a short enough period, Ahmed's 
average earnings might be constant, in which case we would have an R-squared

of 1.

The math

Let us now do the math.  I will tell you that ln(earnings_it), for any
i in the world, at any time t, is given by 

   ln(earnings_it) = a  +  b*S_i           (things about the person)
                        +  c*S_t           (things about the time)
                        +  d*S_it          (things about the person and
                        +  e_it            (a little noise)

Let me tell you that this model is very complete:  I have talked not only 
to economists, but psychologists, epidemiologists, and even physicists.  In 
fact, it was not until I talked to the physicists and they told me about 
quantum effects that I had any randomness in the model at all.  This model 
has everything.

The problem with this model is that I have no hope of measuring most of the
variables contained in S_i, S_t, and S_it.  

Still, I set about estimating this model.  First, I will use cross-sectional
data.  I will use data for t=2002.  The first thing that happens to this
is that I lose all variation in time, so let me recollect terms:

   ln(earnings_i,2002) =    a + c*S_2002          (intercept)
                          + c*S_i + d*S_i,2002    (things about person)
                          + e_i,2002              (noise)

Understand what just happened here:  for t=2002, S_t = S_2002 is just a set
values that do not vary, so c*S_2002 becomes a single constant value.  Let
write the above:

   Ln(earnings_i)      =    a + c*S_2002          intercept
                         +  c' * T_i              T_i = (S_i, S_i,2002)
                         +  noise_i

what I next do is divide T_i into that which I can measure and that which 
I cannot:

               T_i     =  (M_i, U_i)

and that leads to

   Ln(earnings_i)      =    a + c*S_2002          <- intercept
                         +  c1 * M_i              
                         +  noise_i + c2*U_i      <- resulting residual

and there is the model I can estimate.  The residual contains every
person-specific thing I cannot measure, and the intercept contains every 
time-specfic thing (measurable or not).  (Economists:  I have swept under
rug issues of the correlation of variables in the model; this is not
for explaining R-squared.)

Now let's do the time-series model.  I start with the same model, 

   ln(earnings_it) = a  +  b*S_i           (things about the person)
                        +  c*S_t           (things about the time)
                        +  d*S_it          (things about the person and
                        +  e_it            (a little noise)

and this time I set i=1 and end up with 

   Ln(earnings_t)      =    a + c*S_1             <- intercept
                         +  c1 * M_t              
                         +  noise_t + c2*U_t      <- resulting residual

This time, the intercept contains all every person-specific things
or not), and the residual contains every time-specific thing I cannot

Panel datasets

You can do the math for a panel dataset yourself.  It's rather fun, but 
you end up with lots of terms as you divide each separate piece into 
observable and nonobservable.  

Regardless, panel datasets are just a combination of cross-sectional and 
time-series datasets, and so you should expect the reported explanatory
of panel datasets to lie in between.

In fact, however, there is one more thing you need to know:  when we
the explanatory power, we assign no explanatory power to the individual 
intercepts.  This is no different from usual.  What is different from the 
time-series case is that we do have variation across person, so we are in 
effect reporting a "cross-sectional like" R-squared.  The reported R-squared

can be compared with R-squareds from cross sectional models, but not with 
time-series models.

To better understand this detail, try the following experiment:  run a 
fixed-effects model that has just a few fixed effects.  First run it 
using -xtreg, fe-.  Write down the R-squareds.  Now run it using linear 
regression, creating the dummy variables for each of the persons for
All results will be the same, except that the reported R-squared will be
higher.  In the -xtreg, fe- calculation, we are washing out the explanatory
effects of the intercepts.  If you just run it using linear regression,
explantory effects are not removed.

-- Bill

Nina Karstens
Department of Food Economics & Consumption Studies
University of Kiel
*   For searches and help try:

© Copyright 1996–2021 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index