Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: R-squared in panel data models


From   [email protected] (William Gould, Stata)
To   [email protected]
Subject   Re: st: R-squared in panel data models
Date   Thu, 15 May 2003 09:33:25 -0500

Ahmed Diesel <[email protected]> asked, 

> Why is R-squared in panel data models always very low (so that everybody is
> happy about an R-squared of 10%)?  I don't find any explanation about that
> in the literature.

Ahmed has asked a very deep question and one deserving of an answer.

The first answer, of course, is what is acceptable depends on your science,
but some people, hearing that answer, may think that means standards are lower
in some sciences than others, so some scientists get away with things that
other scientists couldn't dream of doing.

That would be a misinterpreation of the short answer.  For different sciences,
what is acceptable can vary because of the nature of the problem itself.  It
all depends on where the "noise" in the data resides, as I will explain.  In
one science, a result with a "poor" R-squared may in fact contain much more
information than, in another science, a result with an R-squared near one.

I suspect Ahmed is an economist so I am going to answer using an economic 
example.  Well, that's only half the reason.  I was trained as an economist.
Anyway, it is easy enough to recast my answer to other sciences and it is 
rather fun to do that, because what turns out to be important and unimportant 
can change.

The argument below has two parts:

    1.  (Substantive) It should not surprise you that if you compare 
        Ahmed's earnings with his own earnings at different times, you can
        explain much of the variation with a few varables.  If you compare
        Ahmed's earnings to, say, Nick Cox's earnings, those same few
        variables will explain less.

    2.  (Calculation) The R-squared reported with panel-dataset models is 
        cross-sectional like rather than time-series like; it can be compared
        with R-squareds from cross-sectional regressions but cannot, without
        adjustment, be compared to R-squareds from time-series models.



Cross-sectional economic data
-----------------------------

A classic problem is annual earnings as a function of eductional attainment, 
age, and labor-market experience:

   ln(earnings_i) = a  +  b*ed_i 
                       +  c1*age_i + c2*(age_i)^2 
                       +  d1*exp_i + d2*(exp_i)^2 
                       +  u_i

Now clearly, there are thousands of other things that that affect the level of
earnings other than educational attainment, age, and labor-market experience,
and those things will vary from person to person in the data.  All those
things are wrapped together in the residual, along with pure luck:

              u_i = e*Z_i + pureluck_i

You should not be surprised when this simple model does a poor job, absolutely
speaking, at explaining the level of earnings.  Consider the subset of the
data, persons with ed=16 (college graduate), age 35, and all having worked 14
years; you know you will see considerable variation in their earnings.

That, however, is not a criticism of the model.  It may turn out that we
accurately estimate a large effect for educational attainment.  That would be
useful information:  we may not be able to explain across people the overall
level of earnings very accurately, but we might very accurately be able to
measure the effect of education.


Time-series data
----------------

Now let's consider the same problem but this time, use time-series data to
estimate it.  What we are going to do is take one person from our cross
sectional dataset (say the first person), and collect data over time, and then
estimate:

   ln(earnings_1t) = a  +  b*ed_1t
                        +  c1*age_1t + c2*(age_1t)^2 
                        +  d1*exp_1t + d2*(exp_1t)^2 
                        +  u_1t

I assert that, if you do this, you will find that you can explain the
variation in earnings very well:  R-squared will be high.  The reason for that
is that, this time, it will be the coefficient "a" rather than than the
residual u which will include the 1,000s of variables that we did not measure.

As a technical note, let me say that in the formulas we use to calculate
R-squared, we do not really assign any explanatory power to the intercept, but
that is misleading, because neither do we, in the data, ever observe any
variation across person -- there is only one person.  Thus, the net result is
as if we did assign explanatory power to the intercept in the sense of
cross-person variation.

I'll give you the math, but before that, just think about it.  You take 
Ahmed Diesel and collect his earnings over time.   Now you set about 
"explaining" his earnings.  Ahmed's average earnings, by itself, will provide
lots of explanatory power.  Indeed, over a short enough period, Ahmed's 
average earnings might be constant, in which case we would have an R-squared 
of 1.


The math
--------

Let us now do the math.  I will tell you that ln(earnings_it), for any person
i in the world, at any time t, is given by 

   ln(earnings_it) = a  +  b*S_i           (things about the person)
                        +  c*S_t           (things about the time)
                        +  d*S_it          (things about the person and time)
                        +  e_it            (a little noise)

Let me tell you that this model is very complete:  I have talked not only 
to economists, but psychologists, epidemiologists, and even physicists.  In 
fact, it was not until I talked to the physicists and they told me about 
quantum effects that I had any randomness in the model at all.  This model 
has everything.

The problem with this model is that I have no hope of measuring most of the
variables contained in S_i, S_t, and S_it.  

Still, I set about estimating this model.  First, I will use cross-sectional
data.  I will use data for t=2002.  The first thing that happens to this model
is that I lose all variation in time, so let me recollect terms:

   ln(earnings_i,2002) =    a + c*S_2002          (intercept)
                          + c*S_i + d*S_i,2002    (things about person)
                          + e_i,2002              (noise)

Understand what just happened here:  for t=2002, S_t = S_2002 is just a set of
values that do not vary, so c*S_2002 becomes a single constant value.  Let me
write the above:

   Ln(earnings_i)      =    a + c*S_2002          intercept
                         +  c' * T_i              T_i = (S_i, S_i,2002)
                         +  noise_i

what I next do is divide T_i into that which I can measure and that which 
I cannot:

               T_i     =  (M_i, U_i)

and that leads to

   Ln(earnings_i)      =    a + c*S_2002          <- intercept
                         +  c1 * M_i              
                         +  noise_i + c2*U_i      <- resulting residual

and there is the model I can estimate.  The residual contains every
person-specific thing I cannot measure, and the intercept contains every 
time-specfic thing (measurable or not).  (Economists:  I have swept under the
rug issues of the correlation of variables in the model; this is not important 
for explaining R-squared.)

Now let's do the time-series model.  I start with the same model, 

   ln(earnings_it) = a  +  b*S_i           (things about the person)
                        +  c*S_t           (things about the time)
                        +  d*S_it          (things about the person and time)
                        +  e_it            (a little noise)

and this time I set i=1 and end up with 

   Ln(earnings_t)      =    a + c*S_1             <- intercept
                         +  c1 * M_t              
                         +  noise_t + c2*U_t      <- resulting residual

This time, the intercept contains all every person-specific things (measurable
or not), and the residual contains every time-specific thing I cannot measure.


Panel datasets
--------------

You can do the math for a panel dataset yourself.  It's rather fun, but 
you end up with lots of terms as you divide each separate piece into 
observable and nonobservable.  

Regardless, panel datasets are just a combination of cross-sectional and 
time-series datasets, and so you should expect the reported explanatory power
of panel datasets to lie in between.

In fact, however, there is one more thing you need to know:  when we calculate
the explanatory power, we assign no explanatory power to the individual 
intercepts.  This is no different from usual.  What is different from the 
time-series case is that we do have variation across person, so we are in 
effect reporting a "cross-sectional like" R-squared.  The reported R-squared 
can be compared with R-squareds from cross sectional models, but not with 
time-series models.

To better understand this detail, try the following experiment:  run a 
fixed-effects model that has just a few fixed effects.  First run it 
using -xtreg, fe-.  Write down the R-squareds.  Now run it using linear 
regression, creating the dummy variables for each of the persons for yourself.
All results will be the same, except that the reported R-squared will be much
higher.  In the -xtreg, fe- calculation, we are washing out the explanatory
effects of the intercepts.  If you just run it using linear regression, those
explantory effects are not removed.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index