»  Home »  Resources & support »  FAQs »  The anova command and collinearity

## How does the anova command handle collinearity?

 Title The anova command and collinearity Author William Sribney, StataCorp

Here is an example that illustrates what happens.

. input woman twin

woman       twin
1.       1     1
2.       2     1
3.       3     2
4.       4     2
5.       5     3
6.       6     3
7. end

. tab woman, gen(w)

 woman Freq. Percent Cum. 1 1 16.67 16.67 2 1 16.67 33.33 3 1 16.67 50.00 4 1 16.67 66.67 5 1 16.67 83.33 6 1 16.67 100.00 Total 6 100.00
. tab twin, gen(t)
 twin Freq. Percent Cum. 1 2 33.33 33.33 2 2 33.33 66.67 3 2 33.33 100.00 Total 6 100.00
. gen t1w1 = t1*w1 - t1*w2 . gen t2w3 = t2*w3 - t2*w4 . gen t3w5 = t3*w5 - t3*w6 . list w* t*, nodisplay sep(0)
 woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5 1. 1 1 0 0 0 0 0 1 1 0 0 1 0 0 2. 2 0 1 0 0 0 0 1 1 0 0 -1 0 0 3. 3 0 0 1 0 0 0 2 0 1 0 0 1 0 4. 4 0 0 0 1 0 0 2 0 1 0 0 -1 0 5. 5 0 0 0 0 1 0 3 0 0 1 0 0 1 6. 6 0 0 0 0 0 1 3 0 0 1 0 0 -1
. set seed 123 . gen x = 12 - int(2*runiform()) . expand x (63 observations created) . gen y = runiform() . anova y woman twin Number of obs = 69 R-squared = 0.0251 Root MSE = .304633 Adj R-squared = -0.0523
 Source Partial SS df MS F Prob>F Model .15054776 5 .03010955 0.32 0.8964 woman .15054776 5 .03010955 0.32 0.8964 twin 0 0 Residual 5.8464905 63 .09280144 Total 5.9970382 68 .08819174
. regress y w1-w5 t1-t3 note: w1 omitted because of collinearity note: w3 omitted because of collinearity note: t1 omitted because of collinearity
 Source SS df MS Number of obs = 69 F(5, 63) = 0.32 Model .150547762 5 .030109552 Prob > F = 0.8964 Residual 5.84649045 63 .092801436 R-squared = 0.0251 Adj R-squared = -0.0523 Total 5.99703821 68 .088191738 Root MSE = .30463
 y Coef. Std. Err. t P>|t| [95% Conf. Interval] w1 .0516343 .1271611 0.41 0.686 -.2024769 .3057455 w2 0 (omitted) w3 0 (omitted) w4 .0359635 .1298961 0.28 0.783 -.2236131 .29554 w5 .0800831 .124366 0.64 0.522 -.1684426 .3286087 t1 0 (omitted) t2 -.0798703 .1298961 -0.61 0.541 -.3394469 .1797063 t3 -.0642359 .1271611 -0.51 0.615 -.3183471 .1898752 _cons .5206881 .0918504 5.67 0.000 .3371397 .7042364

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin woman

Number of obs =          69   R-squared     =  0.0251
Root MSE      =     .304633   Adj R-squared = -0.0523

 Source Partial SS df MS F Prob>F Model .15054776 5 .03010955 0.32 0.8964 twin .09122261 2 .04561131 0.49 0.6140 woman .06089443 3 .02029814 0.22 0.8831 Residual 5.8464905 63 .09280144 Total 5.9970382 68 .08819174

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#woman

Number of obs =         69   R-squared     =  0.0251
Root MSE      =    .304633   Adj R-squared = -0.0523

 Source Partial SS df MS F Prob>F Model .15054776 5 .03010955 0.32 0.8964 twin .0872024 2 .0436012 0.47 0.6273 twin#woman .06089443 3 .02029814 0.22 0.8831 Residual 5.8464905 63 .09280144 Total 5.9970382 68 .08819174

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

 Source SS df MS Number of obs = 69 F(5, 63) = 0.32 Model .150547762 5 .030109552 Prob > F = 0.8964 Residual 5.84649045 63 .092801436 R-squared = 0.0251 Adj R-squared = -0.0523 Total 5.99703821 68 .088191738 Root MSE = .30463
 y Coef. Std. Err. t P>|t| [95% Conf. Interval] t1 .0500115 .0889338 0.56 0.576 -.1277084 .2277315 t2 -.0376941 .0899165 -0.42 0.676 -.2173779 .1419896 t1w1 .0258171 .0635806 0.41 0.686 -.1012385 .1528727 t2w3 -.0179817 .064948 -0.28 0.783 -.14777 .1118066 t3w5 .0400415 .062183 0.64 0.522 -.0842213 .1643044 _cons .4964937 .062183 7.98 0.000 .3722309 .6207565
. test t1 t2 ( 1) t1 = 0 ( 2) t2 = 0 F( 2, 63) = 0.47 Prob > F = 0.6273

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5

( 1)  t1w1 = 0
( 2)  t2w3 = 0
( 3)  t3w5 = 0

F(  3,    63) =    0.22
Prob > F =    0.8831

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of t1 = t2 = 0 is a test of

y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5

The following explains sequential SS:

. anova y twin twin#woman, seq

Number of obs =          69    R-squared     =  0.0251
Root MSE      =     .304633    Adj R-squared = -0.0523

 Source Seq. SS df MS F Prob>F Model .15054776 5 .03010955 0.32 0.8964 twin .08965333 2 .04482667 0.48 0.6192 twin#woman .06089443 3 .02029814 0.22 0.8831 Residual 5.8464905 63 .09280144 Total 5.9970382 68 .08819174
. anova y twin Number of obs = 69 R-squared = 0.0149 Root MSE = .299175 Adj R-squared = -0.0149
 Source Partial SS df MS F Prob>F Model .08965333 2 .04482667 0.50 0.6083 twin .08965333 2 .04482667 0.50 0.6083 Residual 5.9073849 66 .08950583 Total 5.9970382 68 .08819174

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.