»  Home »  Resources & support »  FAQs »  Pooling data in linear regression

How can I pool data (and perform Chow tests) in linear regression without constraining the residual variances to be equal?

 Title Pooling data and performing Chow tests in linear regression Author William Gould, StataCorp

1. Pooling data and constraining residual variance

Consider the linear regression model,

y = β0 + β1x1 + β2x2 + u, u ~ N(0, σ2 )

and let us pretend that we have two groups of data, group=1 and group=2. We could have more groups; everything said below generalizes to more than two groups.

We could estimate the models separately by typing

	. regress y x1 x2 if group==1


and

        . regress y x1 x2 if group==2


or we could pool the data and estimate a single model, one way being

        . generate g2 = (group==2)
. generate g2x1 = g2*x1
. generate g2x2 = g2*x2
. regress y x1 x2 g2 g2x1 g2x2


The difference between these two approaches is that we are constraining the variance of the residual to be the same in the two groups when we pool the data. When we estimated separately, we estimated

group 1: y = β01 + β11x1 + β21x2 + u1, u1 ~ N(0, σ12)

and

group 2: y = β02 + β12x1 + β22x2 + u2, u2 ~ N(0, σ22)

When we pooled the data, we estimated

y = β01 + β11x1 + β21x2 + (β02-β01)g2 + (β12-β11)g2x1 + (β22-β21)g2x2 + u, u ~ N(0, σ2)

If we evaluate this equation for the groups separately, we obtain

y = β01 + β11x1 + β21x2 + u, u ~ N(0,σ2) for group=1

and

y = β02 + β12x1 + β22x2 + u, u ~ N(0,σ2) for group=2

The difference is that we have now constrained the variance of u for group=1 to be the same as the variance of u for group=2.

If you perform this experiment with real data, you will observe the following:

• You will obtain the same values for the coefficients either way.
• You will obtain different standard errors and therefore different test statistics and confidence intervals.

If u is known to have the same variance in the two groups, the standard errors obtained from the pooled regression are better—they are more efficient. If the variances really are different, however, then the standard errors obtained from the pooled regression are wrong.

2. Illustration (See the do-file and the log with the results in section 7)

I have created a dataset (containing made-up data) on y, x1, and x2. The dataset has 74 observations for group=1 and another 71 observations for group=2. Using these data, I can run the regressions separately by typing

   [1]  . regress y x1 x2 if group==1
[2]  . regress y x1 x2 if group==2


or I can run the pooled model by typing

        . generate g2 = (group==2)
. generate g2x1 = g2*x1
. generate g2x2 = g2*x2
[3]  . regress y x1 x2 g2 g2x1 g2x2


I did that in Stata, and it let me summarize the results. When I typed command [1], I obtained the following results (standard errors in parentheses):

        y = -25.98694 + 1.55219*x1 + -.4668375*x2 + u,      Var(u) = 15.5282
(22.21679)  (.5321335)   (.3961253)


and when I ran command [2], I obtained

        y = 14.95656 +   .610196*x1 + .8038961*x2 + u,      Var(u) = 6.87932
(10.14317)   (.2396607)   (.181567)


When I ran command [3], I obtained

        y = -25.98694 + 1.55219*x1 + -.4668375*x2 +
(17.3065)   (.4145229)   + (.3085748)

40.9435*g2 + -.9419941*g2x1 + 1.270734*g2x2 + u, Var(u) = 12.0962
(24.85126)    (.5910998)       (.444)


The intercept and coefficients on x1 and x2 in [3] are the same as in [1], but the standard errors are different. Also, if I sum the appropriate coefficients in [3], I obtain the same results as [2]:

        Intercept: 40.9435    + -25.98694 = 14.95656   ([2] has 14.95656)
x1:         -.9419941 + 1.55219   =   .6101959 ([2] has   .6101959)
x2:         1.270734  + -.4668375 =   .8038965 ([2] has   .8038965)


The coefficients are the same, estimated either way. (The fact that the coefficients in [3] are a little off from those in [2] is just because I did not write down enough digits.)

The standard errors for the coefficients are different.

I also wrote down the estimated Var(u), what is reported as RMSE in Stata’s regression output. In standard deviation terms, u has s.d. 15.528 in group=1, 6.8793 in group=2, and if we constrain these two very different numbers to be the same, the pooled s.d. is 12.096.

3. Pooling data without constraining residual variance

We can pool the data and estimate an equation without constraining the residual variances of the groups to be the same. Previously we typed

        . generate g2 = (group==2)
. generate g2x1 = g2*x1
. generate g2x2 = g2*x2
. regress y x1 x2 g2 g2x1 g2x2


and we start exactly the same way. To that, we add

        . predict r, resid
. summarize r if group==1
. generate w = r(Var)*(r(N)-1)/(r(N)-3) if group==1
. summarize r if group==2
. replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2
[4]  . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]


In the above, the constant 3 that appears twice is 3 because there were three coefficients being estimated in each group (an intercept, a coefficient for x1, and a coefficient for x2). If there were a different number of coefficients being estimated, that number would change.

In any case, this will reproduce exactly the standard errors reported by estimating the two models separately. The advantage is that we can now test equality of coefficients between the two equations. For instance, we can now read right off the pooled regression results whether the effect of x1 is the same in groups 1 and 2 (answer: is _b[g2x1]==0?, because _b[x1] is the effect in group 1 and _b[x1]+_b[g2x1] is the effect in group 2, so the difference is _b[g2x1]). And, using test, we can test other constraints as well.

For instance, if you wanted to prove to yourself that the results of [4] are the same as typing regress y x1 x2 if group==2, you could type

        . test x1 + g2x1 == 0           (reproduces test of x1 for group==2)
and
. test x2 + g2x2 == 0           (reproduces test of x2 for group==2)


4. Illustration

Using the made-up data, I did exactly that. To recap, first I estimated separate regressions:

   [1]  . regress y x1 x2 if group==1
[2]  . regress y x1 x2 if group==2


and then I ran the variance-constrained regression,

        . generate g2 = (group==2)
. generate g2x1 = g2*x1
. generate g2x2 = g2*x2
[3]  . regress y x1 x2 g2 g2x1 g2x2


and then I ran the variance-unconstrained regression,

        . predict r, resid
. summarize r if group==1
. generate w = r(Var)*(r(N)-1)/(r(N)-3) if group==1
. summarize r if group==2
. replace w = r(Var)*(r(N)-1)/(r(N)-3) if group==2
[4]  . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]


Just to remind you, here is what commands [1] and [2] reported:

        y = -25.98694 + 1.55219*x1 + -.4668375*x2 + u,      Var(u) = 15.5282
(22.21679)  (.5321335)   (.3961253)

y = 14.95656 +   .610196*x1 + .8038961*x2 + u,      Var(u) = 6.87932
(10.14317)   (.2396607)   (.181567)


Here is what command [4] reported:

        y = -25.98694  + 1.55219*x1 + -.4668375*x2 +
(22.21679)   (.5321335)   (.3961253)

40.9435*g2 + -.9419941*g2x1 + 1.270734*g2x2 + u
(24.42273)    (.5836123)       (.4357543)


Those results are the same as [1] and [2]. (Pay no attention to the RMSE reported by regress at this last step; the reported RMSE is the standard deviation of neither of the two groups but is instead a weighted average; see the FAQ on this if you care. If you want to know the standard errors of the respective residuals, look back at the output from the summarize statements typed when producing the weighting variable.)

 Technical Note:   In creating the weights, we typed  . summarize r if group==1 . generate w = r(Var)*(r(N)-1)/(r(N)-3) if group==1  and similarly for group 2. The 3 that appears in the finite-sample normalization factor (r(N)-1)/(r(N)-3) appears because there are three coefficients per group being estimated. If our model had fewer or more coefficients, that number would change. In fact, the finite-sample normalization factor changes results very little. In real work, I would have ignored it and typed  . summarize r if group==1 . generate w = r(Var) if group==1  unless the number of observations in one of the groups was very small. The normalization factor was included here so that [4] would produce the same results as [1] and [2].

5. The (lack of) importance of not constraining the variance

Does it matter whether we constrain the variance? Here, it does not matter much. For instance, if after

   [4]  . regress y x1 x2 g2 g2x1 g2x2 [aw=1/w]


we test whether group 2 is the same as group 1, we obtain

        . test g2x1 g2x2 g2

( 1)  g2x1 = 0.0
( 2)  g2x2 = 0.0
( 3)  g2 = 0.0

F(  3,   139) =  316.69
Prob > F =    0.0000


If instead we had constrained the variances to be the same, estimating the model using

   [3]  . regress y x1 x2 g2 g2x1 g2x2


and then repeated the test, the reported F-statistic would be 309.08.

If there were more groups, and the variance differences were great among the groups, this could become more important.

6. Another way to fit the variance-unconstrained model

Stata’s xtgls, panels(het) command (see xtgls) fits exactly the model we have been describing, the only difference being that it does not make all the finite-sample adjustments, so its standard errors are just a little different from those produced by the method just described. (To be clear, xtgls, panels(het) does not make the adjustment described in the technical note above, and it does not make the finite-sample adjustments regress itself makes, so variances are invariable normalized by N, the number of observations, rather than N-k, observations minus number of estimated coefficients.)

Anyway, to estimate xtgls, panels(het), you pool the data just as always,

        . generate g2 = (group==2)
. generate g2x1 = g2*x1
. generate g2x2 = g2*x2


and then type

   [5]  . xtgls y x1 x2 g2 g2x1 g2x2, panels(het) i(group)


to estimate the model. The result of doing that with my fictional data is

        y = -25.98694 + 1.55219*x1 + -.4668375*x2 +
(21.76179)  (.5212354)   (.3880126)

40.9435*g2 + -.9419941*g2x1 + 1.270734*g2x2 + u
(23.91887)    (.5715739)       (.4267639)

These are the same coefficients we have always seen.

The standard errors produced by xtgls, panels(het) here are about 2% smaller than those produced by [4] and in general will be a little smaller because xtgls, panels(het) is an asymptotically based estimator. The two estimators are asymptotically equivalent, however, and in fact quickly become identical. The only caution I would advise is not to use xtgls, panels(het) if the number of degrees of freedom (observations minus number of coefficients) is below 25 in any of the groups. Then, the weighted OLS approach [4] is better (and you should make the finite-sample adjustment described in the above technical note).

7.1 do-file

The following do-file, named uncv.do, was used. Up until the line reading “BEGINNING OF DEMONSTRATION’, the do-file is concerned with constructing the artificial dataset for the demonstration: uncv.do

7.2 log

The do-file shown in 7.1 produced the following output: uncv.log