Home  /  Resources & support  /  FAQs  /  The anova command and collinearity

How does the anova command handle collinearity?

Title   The anova command and collinearity
Author William Sribney, StataCorp

Here is an example that illustrates what happens.

. input woman twin

         woman       twin
  1.       1     1  
  2.       2     1  
  3.       3     2  
  4.       4     2  
  5.       5     3  
  6.       6     3  
  7. end

. tab woman, gen(w)

woman Freq. Percent Cum.
1 1 16.67 16.67
2 1 16.67 33.33
3 1 16.67 50.00
4 1 16.67 66.67
5 1 16.67 83.33
6 1 16.67 100.00
Total 6 100.00
. tab twin, gen(t)
twin Freq. Percent Cum.
1 2 33.33 33.33
2 2 33.33 66.67
3 2 33.33 100.00
Total 6 100.00
. gen t1w1 = t1*w1 - t1*w2 . gen t2w3 = t2*w3 - t2*w4 . gen t3w5 = t3*w5 - t3*w6 . list w* t*, nodisplay sep(0)
  woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5
1. 1 1 0 0 0 0 0 1 1 0 0 1 0 0
2. 2 0 1 0 0 0 0 1 1 0 0 -1 0 0
3. 3 0 0 1 0 0 0 2 0 1 0 0 1 0
4. 4 0 0 0 1 0 0 2 0 1 0 0 -1 0
5. 5 0 0 0 0 1 0 3 0 0 1 0 0 1
6. 6 0 0 0 0 0 1 3 0 0 1 0 0 -1
. set seed 123 . gen x = 12 - int(2*runiform()) . expand x (63 observations created) . gen y = runiform() . anova y woman twin Number of obs = 69 R-squared = 0.0251 Root MSE = .304633 Adj R-squared = -0.0523
Source Partial SS df MS F Prob>F
Model .15054776 5 .03010955 0.32 0.8964
woman .15054776 5 .03010955 0.32 0.8964
twin 0 0
Residual 5.8464905 63 .09280144
Total 5.9970382 68 .08819174
. regress y w1-w5 t1-t3 note: w1 omitted because of collinearity note: w3 omitted because of collinearity note: t1 omitted because of collinearity
Source SS df MS Number of obs = 69
F(5, 63) = 0.32
Model .150547762 5 .030109552 Prob > F = 0.8964
Residual 5.84649045 63 .092801436 R-squared = 0.0251
Adj R-squared = -0.0523
Total 5.99703821 68 .088191738 Root MSE = .30463
y Coef. Std. Err. t P>|t| [95% Conf. Interval]
w1 .0516343 .1271611 0.41 0.686 -.2024769 .3057455
w2 0 (omitted)
w3 0 (omitted)
w4 .0359635 .1298961 0.28 0.783 -.2236131 .29554
w5 .0800831 .124366 0.64 0.522 -.1684426 .3286087
t1 0 (omitted)
t2 -.0798703 .1298961 -0.61 0.541 -.3394469 .1797063
t3 -.0642359 .1271611 -0.51 0.615 -.3183471 .1898752
_cons .5206881 .0918504 5.67 0.000 .3371397 .7042364

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin woman

                           Number of obs =          69   R-squared     =  0.0251
                           Root MSE      =     .304633   Adj R-squared = -0.0523

Source Partial SS df MS F Prob>F
Model .15054776 5 .03010955 0.32 0.8964
twin .09122261 2 .04561131 0.49 0.6140
woman .06089443 3 .02029814 0.22 0.8831
Residual 5.8464905 63 .09280144
Total 5.9970382 68 .08819174

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#woman

                           Number of obs =         69   R-squared     =  0.0251
                           Root MSE      =    .304633   Adj R-squared = -0.0523

Source Partial SS df MS F Prob>F
Model .15054776 5 .03010955 0.32 0.8964
twin .0872024 2 .0436012 0.47 0.6273
twin#woman .06089443 3 .02029814 0.22 0.8831
Residual 5.8464905 63 .09280144
Total 5.9970382 68 .08819174

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

Source SS df MS Number of obs = 69
F(5, 63) = 0.32
Model .150547762 5 .030109552 Prob > F = 0.8964
Residual 5.84649045 63 .092801436 R-squared = 0.0251
Adj R-squared = -0.0523
Total 5.99703821 68 .088191738 Root MSE = .30463
y Coef. Std. Err. t P>|t| [95% Conf. Interval]
t1 .0500115 .0889338 0.56 0.576 -.1277084 .2277315
t2 -.0376941 .0899165 -0.42 0.676 -.2173779 .1419896
t1w1 .0258171 .0635806 0.41 0.686 -.1012385 .1528727
t2w3 -.0179817 .064948 -0.28 0.783 -.14777 .1118066
t3w5 .0400415 .062183 0.64 0.522 -.0842213 .1643044
_cons .4964937 .062183 7.98 0.000 .3722309 .6207565
. test t1 t2 ( 1) t1 = 0 ( 2) t2 = 0 F( 2, 63) = 0.47 Prob > F = 0.6273

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5

 ( 1)  t1w1 = 0
 ( 2)  t2w3 = 0
 ( 3)  t3w5 = 0

       F(  3,    63) =    0.22
            Prob > F =    0.8831

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of t1 = t2 = 0 is a test of

y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5

The following explains sequential SS:

. anova y twin twin#woman, seq

                           Number of obs =          69    R-squared     =  0.0251
                           Root MSE      =     .304633    Adj R-squared = -0.0523

Source Seq. SS df MS F Prob>F
Model .15054776 5 .03010955 0.32 0.8964
twin .08965333 2 .04482667 0.48 0.6192
twin#woman .06089443 3 .02029814 0.22 0.8831
Residual 5.8464905 63 .09280144
Total 5.9970382 68 .08819174
. anova y twin Number of obs = 69 R-squared = 0.0149 Root MSE = .299175 Adj R-squared = -0.0149
Source Partial SS df MS F Prob>F
Model .08965333 2 .04482667 0.50 0.6083
twin .08965333 2 .04482667 0.50 0.6083
Residual 5.9073849 66 .08950583
Total 5.9970382 68 .08819174

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.