Search
   >> Home >> Resources & support >> FAQs >> The anova command and collinearity

How does the anova command handle collinearity?

Title   The anova command and collinearity
Author William Sribney, StataCorp
Date March 1997; updated July 2011

Here is an example that illustrates what happens.

. input woman twin

         woman       twin
  1.       1     1  
  2.       2     1  
  3.       3     2  
  4.       4     2  
  5.       5     3  
  6.       6     3  
  7. end

. tab woman, gen(w)

woman Freq. Percent Cum.
1 1 16.67 16.67
2 1 16.67 33.33
3 1 16.67 50.00
4 1 16.67 66.67
5 1 16.67 83.33
6 1 16.67 100.00
Total 6 100.00
. tab twin, gen(t)
twin Freq. Percent Cum.
1 2 33.33 33.33
2 2 33.33 66.67
3 2 33.33 100.00
Total 6 100.00
. gen t1w1 = t1*w1 - t1*w2 . gen t2w3 = t2*w3 - t2*w4 . gen t3w5 = t3*w5 - t3*w6 . list w* t*, nodisplay sep(0)
  woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5
1. 1 1 0 0 0 0 0 1 1 0 0 1 0 0
2. 2 0 1 0 0 0 0 1 1 0 0 -1 0 0
3. 3 0 0 1 0 0 0 2 0 1 0 0 1 0
4. 4 0 0 0 1 0 0 2 0 1 0 0 -1 0
5. 5 0 0 0 0 1 0 3 0 0 1 0 0 1
6. 6 0 0 0 0 0 1 3 0 0 1 0 0 -1
. set seed 123 . gen x = 12 - int(2*runiform()) . expand x (63 observations created) . gen y = runiform() . anova y woman twin Number of obs = 69 R-squared = 0.1273 Root MSE = .282929 Adj R-squared = 0.0580
Source Partial SS df MS F Prob > F
Model .73544635 5 .14708927 1.84 0.1183
woman .73544635 5 .14708927 1.84 0.1183
twin 0 0
Residual 5.04305934 63 .080048561
Total 5.77850569 68 .084978025
. regress y w1-w5 t1-t3 note: w1 omitted because of collinearity note: w3 omitted because of collinearity note: t1 omitted because of collinearity
Source SS df MS Number of obs = 69
F( 5, 63) = 1.84
Model .73544635 5 .14708927 Prob > F = 0.1183
Residual 5.04305934 63 .080048561 R-squared = 0.1273
Adj R-squared = 0.0580
Total 5.77850569 68 .084978025 Root MSE = .28293
y Coef. Std. Err. t P>|t| [95% Conf. Interval]
w1 0 (omitted)
w2 -.1064115 .118101 -0.90 0.371 -.3424176 .1295946
w3 0 (omitted)
w4 -.0591048 .1155051 -0.51 0.611 -.2899233 .1717137
w5 .3119168 .1206411 2.59 0.012 .0708348 .5529989
t1 0 (omitted)
t2 -.1238703 .118101 -1.05 0.298 -.3598764 .1121358
t3 -.2724871 .1206411 -2.26 0.027 -.5135692 -.031405
_cons .5838711 .0853062 6.84 0.000 .4134004 .7543419

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin woman

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

Source Partial SS df MS F Prob > F
Model .73544635 5 .14708927 1.84 0.1183
twin .425327562 2 .212663781 2.66 0.0780
woman .621053463 3 .207017821 2.59 0.0609
Residual 5.04305934 63 .080048561
Total 5.77850569 68 .084978025

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#woman

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

Source Partial SS df MS F Prob > F
Model .73544635 5 .14708927 1.84 0.1183
twin .120036739 2 .06001837 0.75 0.4766
twin#woman .621053463 3 .207017821 2.59 0.0609
Residual 5.04305934 63 .080048561
Total 5.77850569 68 .084978025

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

Source SS df MS Number of obs = 69
F( 5, 63) = 1.84
Model .73544635 5 .14708927 Prob > F = 0.1183
Residual 5.04305934 63 .080048561 R-squared = 0.1273
Adj R-squared = 0.0580
Total 5.77850569 68 .084978025 Root MSE = .28293
y Coef. Std. Err. t P>|t| [95% Conf. Interval]
t1 .0633229 .0844129 0.75 0.456 -.1053628 .2320086
t2 -.0368941 .08351 -0.44 0.660 -.2037756 .1299874
t1w1 .0532058 .0590505 0.90 0.371 -.0647973 .1712088
t2w3 .0295524 .0577525 0.51 0.611 -.0858569 .1449617
t3w5 .1559584 .0603206 2.59 0.012 .0354174 .2764995
_cons .4673425 .0603206 7.75 0.000 .3468014 .5878835
. test t1 t2 ( 1) t1 = 0 ( 2) t2 = 0 F( 2, 63) = 0.75 Prob > F = 0.4766

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5

 ( 1)  t1w1 = 0
 ( 2)  t2w3 = 0
 ( 3)  t3w5 = 0

       F(  3,    63) =    2.59
            Prob > F =    0.0609

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of t1 = t2 = 0 is a test of

y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5

The following explains sequential SS:

. anova y twin twin#woman, seq

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

Source Seq. SS df MS F Prob > F
Model .73544635 5 .14708927 1.84 0.1183
twin .114392887 2 .057196444 0.71 0.4933
twin#woman .621053463 3 .207017821 2.59 0.0609
Residual 5.04305934 63 .080048561
Total 5.77850569 68 .084978025
. anova y twin Number of obs = 69 R-squared = 0.0198 Root MSE = .29295 Adj R-squared = -0.0099
Source Partial SS df MS F Prob > F
Model .114392887 2 .057196444 0.67 0.5169
twin .114392887 2 .057196444 0.67 0.5169
Residual 5.6641128 66 .085819891
Total 5.77850569 68 .084978025

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.

The Stata Blog: Not Elsewhere Classified Find us on Facebook Follow us on Twitter LinkedIn Google+ Watch us on YouTube