How can I pool data (and
perform Chow tests) in linear regression without constraining the residual variances
to be equal?
| Title |
|
Pooling data and performing Chow tests in linear regression |
| Author |
William Gould, StataCorp |
| Date |
December 1999; updated August 2005 |
- Pooling data and constraining residual variance
- Illustration
- Pooling data without constraining residual variance
- Illustration
- The (lack of) importance of not constraining the
variance
- Another way to fit the variance-unconstrained model
- Appendix: do-file and log providing results reported
above
-
7.1 do-file
7.2 log
1. Pooling data and constraining residual variance
Consider the linear regression model,
y = β0 + β1x1 +
β2x2 + u,
u ~ N(0, σ2 )
and let us pretend that we have two groups of data, group=1 and group=2. We
could have more groups; everything said below generalizes to more than two
groups.
We could estimate the models separately by typing
. regress y x1 x2 if group==1
and
. regress y x1 x2 if group==2
or we could pool the data and estimate a single model, one way being
. gen g2 = (group==2)
. gen g2x1 = g2*x1
. gen g2x2 = g2*x2
. regress y x1 x2 g2 g2x1 g2x2
The difference between these two approaches is that we are constraining the
variance of the residual to be the same in the two groups when we pool the
data. When we estimated separately, we estimated
group 1:
y = β01 +
β11x1 +
β21x2 +
u1,
u1 ~ N(0, σ12)
and
group 2:
y = β02 +
β12x1 +
β22x2 +
u2,
u2 ~ N(0, σ22)
When we pooled the data, we estimated
y = β01 + β11x1 +
β21x2 +
(β02-β01)g2 +
(β12-β11)g2x1 +
(β22-β21)g2x2 +
u, u ~ N(0, σ2)
If we evaluate this equation for the groups separately, we obtain
y = β01 + β11x1 + β21x2 +
u, u ~ N(0,σ2) for group=1
and
y = β02 + β12x1 + β22x2 +
u, u ~ N(0,σ2) for group=2
The difference is that we have now constrained the variance of u for
group=1 to be the same as the variance of u for group=2.
If you perform this experiment with real data, you will observe the following:
- You will obtain the same values for the coefficients either way.
- You will obtain different standard errors and therefore
different test statistics and confidence intervals.
If u is known to have the same variance in the two groups, the
standard errors obtained from the pooled regression are better—they
are more efficient. If the variances really are different, however, then
the standard errors obtained from the pooled regression are wrong.
2. Illustration
(See the do-file and the log with the results in section 7)
I have created a dataset (containing made-up data) on y, x1,
and x2. The dataset has 74 observations for group=1 and another 71
observations for group=2. Using these data, I can run the regressions
separately by typing
[1] . regress y x1 x2 if group==1
[2] . regress y x1 x2 if group==2
or I can run the pooled model by typing
. gen g2 = (group==2)
. gen g2x1 = g2*x1
. gen g2x2 = g2*x2
[3] . regress y x1 x2 g2 g2x1 g2x2
I did that in Stata, and it let me summarize the results. When I typed command
[1], I obtained the following results (standard errors in parentheses):
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + u, Var(u) = 15.8912
(22.73703) (.54459) (.405401)
and when I ran command [2], I obtained
y = 4.646994 + .9307004*x1 + .8812369*x2 + u, Var(u) = 7.56852
(11.1593) (.236696) (.1997562)
When I ran command [3], I obtained
y = -8.650993 + 1.21329*x1 + -.8809939*x2 +
(17.92853) (.42942) + (.3196656)
13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u, Var(u) = 12.5312
(25.74446) (.6123452) (.459958)
The intercept and coefficients on x1 and x2 in [3] are the
same as in [1], but the standard errors are different. Also, if I sum the
appropriate coefficients in [3], I obtain the same results as [2]:
Intercept: 13.29779 + -8.650993 = 4.646797 ([2] has 4.646994)
x1: -.2825893 + 1.21329 = .9307004 ([2] has .9307004)
x2: 1.762231 + -.8809939 = .8812371 ([2] has .8812369)
The coefficients are the same, estimated either way. (The fact that the
coefficients in [3] are a little off from those in [2] is just because I did
not write down enough digits.)
The standard errors for the coefficients are different.
I also wrote down the estimated Var(u), what is reported as RMSE in
Stata’s regression output. In standard deviation terms, u has
s.d. 15.891 in group=1, 7.5685 in group=2, and if we constrain these two
very different numbers to be the same, the pooled s.d. is 12.531.
3. Pooling data without constraining residual variance
We can pool the data and estimate an equation without constraining the
residual variances of the groups to be the same. Previously we typed
. gen g2 = (group==2)
. gen g2x1 = g2*x1
. gen g2x2 = g2*x2
. regress y x1 x2 g2 g2x1 g2x2
and we start exactly the same way. To that, we add
. predict r, resid
. sum r if group==1
. gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1
. sum r if group==2
. replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2
[4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]
In the above, the constant 3 that appears twice is 3 because there
were three coefficients being estimated in each group (an intercept, a
coefficient for x1, and a coefficient for x2). If there were
a different number of coefficients being estimated, that number would
change.
In any case, this will reproduce exactly the standard errors reported by
estimating the two models separately. The advantage is that we can now test
equality of coefficients between the two equations. For instance, we can
now read right off the pooled regression results whether the effect of
x1 is the same in groups 1 and 2 (answer: is _b[g2x1]==0?, because
_b[x1] is the effect in group 1 and _b[x1]+_b[g2x1] is the effect in group
2, so the difference is _b[g2x1]). And, using
test, we can test
other constraints as well.
For instance, if you wanted to prove to yourself that the results of [4] are
the same as typing regress y x1 x2 if group==2, you could type
. test x1 + g2x1 == 0 (reproduces test of x1 for group==2)
and
. test x2 + g2x2 == 0 (reproduces test of x2 for group==2)
4. Illustration
Using the made-up data, I did exactly that. To recap, first I estimated
separate regressions:
[1] . regress y x1 x2 if group==1
[2] . regress y x1 x2 if group==2
and then I ran the variance-constrained regression,
. gen g2 = (group==2)
. gen g2x1 = g2*x1
. gen g2x2 = g2*x2
[3] . regress y x1 x2 g2 g2x1 g2x2
and then I ran the variance-unconstrained regression,
. predict r, resid
. sum r if group==1
. gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1
. sum r if group==2
. replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2
[4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]
Just to remind you, here is what commands [1] and [2] reported:
y = -8.650993 + 1.21329*x1 + -.8809939*x2 + u, Var(u) = 15.8912
(22.73703) (.54459) (.405401)
y = 4.646994 + .9307004*x1 + .8812369*x2 + u, Var(u) = 7.56852
(11.1593) (.236696) (.1997562)
Here is what command [4] reported:
y = -8.650993 + 1.21329*x1 + -.8809939*x2 +
(22.73703) (.54459) (.405401)
13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u
(25.3269) (.6050657) (.451943)
Those results are the same as [1] and [2]. (Pay no attention to the RMSE
reported by regress at this last step; the reported RMSE is the
standard deviation of neither of the two groups but is instead a weighted
average; see the FAQ on this if you care. If you
want to know the standard errors of the respective residuals, look back at
the output from the summarize statements typed when producing the
weighting variable.)
|
Technical Note:
In creating the weights, we typed
. sum r if group==1
. gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1
and similarly for group 2. The 3 that appears in the finite-sample
normalization factor (r(N)-1)/(r(N)-3) appears because there are three
coefficients per group being estimated. If our model had fewer or more
coefficients, that number would change. In fact, the finite-sample
normalization factor changes results very little. In real work, I would
have ignored it and typed
. sum r if group==1
. gen w = r(Var) if group==1
unless the number of observations in one of the groups was very small. The
normalization factor was included here so that [4] would produce the same
results as [1] and [2].
|
5. The (lack of) importance of not constraining the
variance
Does it matter whether we constrain the variance? Here, it does not matter
much. For instance, if after
[4] . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]
we test whether group 2 is the same as group 1, we obtain
. test g2x1 g2x2 g2
( 1) g2x1 = 0.0
( 2) g2x2 = 0.0
( 3) g2 = 0.0
F( 3, 139) = 307.50
Prob > F = 0.0000
If instead we had constrained the variances to be the same, estimating the
model using
[3] . regress y x1 x2 g2 g2x1 g2x2
and then repeated the test, the reported F-statistic would be 300.81.
If there were more groups, and the variance differences were great among the
groups, this could become more important.
6. Another way to fit the variance-unconstrained model
Stata’s xtgls, panels(het) command (see help
xtgls) fits exactly the
model we have been describing, the only difference being that it does not
make all the finite-sample adjustments, so its standard errors are just a
little different from those produced by the method just described. (To be
clear, xtgls, panels(het) does not make the adjustment described in
the technical note above, and it does not make the
finite-sample adjustments regress itself makes, so
variances are invariable normalized by N, the number of observations,
rather than N-k, observations minus number of estimated
coefficients.)
Anyway, to estimate xtgls, panels(het), you pool the data just as
always,
. gen g2 = (group==2)
. gen g2x1 = g2*x1
. gen g2x2 = g2*x2
and then type
[5] . xtgls y x1 x2 g2 g2x1 g2x2, panels(het) i(group)
to estimate the model. The result of doing that with my fictional data is
y = -8.650993 + 1.21329*x1 + -.8809939*x2 +
(22.27137) (.53344) (.397099)
13.29779*g2 + -.2825893*g2x1 + 1.762231*g2x2 + u
(24.80488) (.5925734) (.442610)
These are the same coefficients we have always seen.
The standard errors produced by xtgls, panels(het) here are about 2%
smaller than those produced by [4] and in general will be a little smaller
because xtgls, panels(het) is an asymptotically based estimator. The
two estimators are asymptotically equivalent, however, and in fact quickly
become identical. The only caution I would advise is not to use xtgls,
panels(het) if the number of degrees of freedom (observations minus
number of coefficients) is below 25 in any of the groups. Then, the
weighted OLS approach [4] is better (and you should make the finite-sample
adjustment described in the above technical note).
7. Appendix: do-file and log providing results reported
above
7.1 do-file
The following do-file, named uncv.do, was used. Up until the line reading
“BEGINNING OF DEMONSTRATION’, the do-file is concerned with
constructing the artificial dataset for the demonstration:
uncv.do
7.2 log
The do-file shown in 7.1 produced the following output:
uncv.log
|