Title | Pooling data and performing Chow tests in linear regression | |

Author | William Gould, StataCorp |

- Pooling data and constraining residual variance
- Illustration
- Pooling data without constraining residual variance
- Illustration
- The (lack of) importance of not constraining the variance
- Another way to fit the variance-unconstrained model
- Appendix: do-file and log providing results reported above

Consider the linear regression model,

and let us pretend that we have two groups of data, group=1 and group=2. We could have more groups; everything said below generalizes to more than two groups.

We could estimate the models separately by typing

. regress y x1 x2 if group==1

and

. regress y x1 x2 if group==2

or we could pool the data and estimate a single model, one way being

. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2

The difference between these two approaches is that we are constraining the variance of the residual to be the same in the two groups when we pool the data. When we estimated separately, we estimated

group 1:
*y* = *β*_{01} +
*β*_{11}*x*_{1} +
*β*_{21}*x*_{2} +
*u*_{1},
*u*_{1} ~ N(0, *σ*_{1}^{2})

and

group 2:
*y* = *β*_{02} +
*β*_{12}*x*_{1} +
*β*_{22}*x*_{2} +
*u*_{2},
*u*_{2} ~ N(0, *σ*_{2}^{2})

When we pooled the data, we estimated

*y* = *β*_{01} + *β*_{11}*x*_{1} +
*β*_{21}*x*_{2} +
(*β*_{02}-*β*_{01})*g*_{2} +
(*β*_{12}-*β*_{11})*g*_{2}*x*_{1} +
(*β*_{22}-*β*_{21})*g*_{2}*x*_{2} +
*u*, *u* ~ N(0, σ^{2})

If we evaluate this equation for the groups separately, we obtain

*y* = *β*_{01} + *β*_{11}*x*_{1} + *β*_{21}*x*_{2} +
*u*, *u* ~ N(0,σ^{2}) for group=1

and

*y* = *β*_{02} + *β*_{12}*x*_{1} + *β*_{22}*x*_{2} +
*u*, *u* ~ N(0,σ^{2}) for group=2

The difference is that we have now constrained the variance of *u* for
group=1 to be the same as the variance of *u* for group=2.

If you perform this experiment with real data, you will observe the following:

- You will obtain the same values for the coefficients either way.
- You will obtain different standard errors and therefore different test statistics and confidence intervals.

If *u* is known to have the same variance in the two groups, the
standard errors obtained from the pooled regression are better—they
are more efficient. If the variances really are different, however, then
the standard errors obtained from the pooled regression are wrong.

I have created a dataset (containing made-up data) on *y*, *x1*,
and *x2*. The dataset has 74 observations for group=1 and another 71
observations for group=2. Using these data, I can run the regressions
separately by typing

[1]. regress y x1 x2 if group==1[2]. regress y x1 x2 if group==2

or I can run the pooled model by typing

. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2[3]. regress y x1 x2 g2 g2x1 g2x2

I did that in Stata, and it let me summarize the results. When I typed command [1], I obtained the following results (standard errors in parentheses):

y = -25.98694 + 1.55219*x1 + -.4668375*x2 + u, Var(u) = 15.528^{2}(22.21679) (.5321335) (.3961253)

and when I ran command [2], I obtained

y = 14.95656 + .610196*x1 + .8038961*x2 + u, Var(u) = 6.8793^{2}(10.14317) (.2396607) (.181567)

When I ran command [3], I obtained

y = -25.98694 + 1.55219*x1 + -.4668375*x2 + (17.3065) (.4145229) + (.3085748) 40.9435*g2 + -.9419941*g2x1 + 1.270734*g2x2 + u, Var(u) = 12.096^{2}(24.85126) (.5910998) (.444)

The intercept and coefficients on *x1* and *x2* in [3] are the
same as in [1], but the standard errors are different. Also, if I sum the
appropriate coefficients in [3], I obtain the same results as [2]:

Intercept: 40.9435 + -25.98694 = 14.95656 ([2] has 14.95656) x1: -.9419941 + 1.55219 = .6101959 ([2] has .6101959) x2: 1.270734 + -.4668375 = .8038965 ([2] has .8038965)

The coefficients are the same, estimated either way. (The fact that the coefficients in [3] are a little off from those in [2] is just because I did not write down enough digits.)

The standard errors for the coefficients are different.

I also wrote down the estimated Var(*u*), what is reported as RMSE in
Stata’s regression output. In standard deviation terms, *u* has
s.d. 15.528 in group=1, 6.8793 in group=2, and if we constrain these two
very different numbers to be the same, the pooled s.d. is 12.096.

We can pool the data and estimate an equation without constraining the residual variances of the groups to be the same. Previously we typed

. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2 . regress y x1 x2 g2 g2x1 g2x2

and we start exactly the same way. To that, we add

. predict r, resid . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-[4]3) if group==1 . sum r if group==2 . replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2. regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]

In the above, the constant **3** that appears twice is 3 because there
were three coefficients being estimated in each group (an intercept, a
coefficient for *x1*, and a coefficient for *x2*). If there were
a different number of coefficients being estimated, that number would
change.

In any case, this will reproduce exactly the standard errors reported by
estimating the two models separately. The advantage is that we can now test
equality of coefficients between the two equations. For instance, we can
now read right off the pooled regression results whether the effect of
*x1* is the same in groups 1 and 2 (answer: is _b[g2x1]==0?, because
_b[x1] is the effect in group 1 and _b[x1]+_b[g2x1] is the effect in group
2, so the difference is _b[g2x1]). And, using
**test**, we can test
other constraints as well.

For instance, if you wanted to prove to yourself that the results of [4] are
the same as typing **regress y x1 x2 if group==2**, you could type

. test x1 + g2x1 == 0(reproduces test of x1 for group==2) and. test x2 + g2x2 == 0(reproduces test of x2 for group==2)

Using the made-up data, I did exactly that. To recap, first I estimated separate regressions:

[1]. regress y x1 x2 if group==1[2]. regress y x1 x2 if group==2

and then I ran the variance-constrained regression,

. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2[3]. regress y x1 x2 g2 g2x1 g2x2

and then I ran the variance-unconstrained regression,

. predict r, resid . sum r if group==1 . gen w = r(Var)*(r(N)-1)/(r(N)-3) if group==1 . sum r if group==2 . replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2[4]. regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]

Just to remind you, here is what commands [1] and [2] reported:

y = -25.98694 + 1.55219*x1 + -.4668375*x2 + u, Var(u) = 15.528^{2}(22.21679) (.5321335) (.3961253) y = 14.95656 + .610196*x1 + .8038961*x2 + u, Var(u) = 6.8793^{2}(10.14317) (.2396607) (.181567)

Here is what command [4] reported:

y = -25.98694 + 1.55219*x1 + -.4668375*x2 + (22.21679) (.5321335) (.3961253) 40.9435*g2 + -.9419941*g2x1 + 1.270734*g2x2 + u (24.42273) (.5836123) (.4357543)

Those results are the same as [1] and [2]. (Pay no attention to the RMSE
reported by **regress** at this last step; the reported RMSE is the
standard deviation of neither of the two groups but is instead a weighted
average; see the FAQ on this if you care. If you
want to know the standard errors of the respective residuals, look back at
the output from the **summarize** statements typed when producing the
weighting variable.)

and similarly for group 2. The
unless the number of observations in one of the groups was very small. The normalization factor was included here so that [4] would produce the same results as [1] and [2]. |

Does it matter whether we constrain the variance? Here, it does not matter much. For instance, if after

[4]. regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]

we test whether group 2 is the same as group 1, we obtain

. test g2x1 g2x2 g2 ( 1) g2x1 = 0.0 ( 2) g2x2 = 0.0 ( 3) g2 = 0.0 F( 3, 139) = 316.69 Prob > F = 0.0000

If instead we had constrained the variances to be the same, estimating the model using

[3]. regress y x1 x2 g2 g2x1 g2x2

and then repeated the **test**, the reported F-statistic would be 309.08.

If there were more groups, and the variance differences were great among the groups, this could become more important.

Stata’s **xtgls, panels(het)** command (see **
xtgls**) fits exactly the
model we have been describing, the only difference being that it does not
make all the finite-sample adjustments, so its standard errors are just a
little different from those produced by the method just described. (To be
clear, **xtgls, panels(het)** does not make the adjustment described in
the technical note above, and it does not make the
finite-sample adjustments **regress** itself makes, so
variances are invariable normalized by *N*, the number of observations,
rather than *N*-*k*, observations minus number of estimated
coefficients.)

Anyway, to estimate **xtgls, panels(het)**, you pool the data just as
always,

. gen g2 = (group==2) . gen g2x1 = g2*x1 . gen g2x2 = g2*x2

and then type

[5]. xtgls y x1 x2 g2 g2x1 g2x2, panels(het) i(group)

to estimate the model. The result of doing that with my fictional data is

y = -25.98694 + 1.55219*x1 + -.4668375*x2 + (21.76179) (.5212354) (.3880126) 40.9435*g2 + -.9419941*g2x1 + 1.270734*g2x2 + u (23.91887) (.5715739) (.4267639)These are the same coefficients we have always seen.

The standard errors produced by **xtgls, panels(het)** here are about 2%
smaller than those produced by [4] and in general will be a little smaller
because **xtgls, panels(het)** is an asymptotically based estimator. The
two estimators are asymptotically equivalent, however, and in fact quickly
become identical. The only caution I would advise is not to use **xtgls,
panels(het)** if the number of degrees of freedom (observations minus
number of coefficients) is below 25 in any of the groups. Then, the
weighted OLS approach [4] is better (and you should make the finite-sample
adjustment described in the above technical note).

The following do-file, named uncv.do, was used. Up until the line reading “BEGINNING OF DEMONSTRATION’, the do-file is concerned with constructing the artificial dataset for the demonstration: uncv.do

The do-file shown in 7.1 produced the following output: uncv.log