Title | The anova command and collinearity | |

Author | William Sribney, StataCorp | |

Date | March 1997; updated July 2011 |

Here is an example that illustrates what happens.

. input woman twinwoman twin 1. 1 1 2. 2 1 3. 3 2 4. 4 2 5. 5 3 6. 6 3 7. end. tab woman, gen(w)

woman | Freq. Percent Cum. | |

1 | 1 16.67 16.67 | |

2 | 1 16.67 33.33 | |

3 | 1 16.67 50.00 | |

4 | 1 16.67 66.67 | |

5 | 1 16.67 83.33 | |

6 | 1 16.67 100.00 | |

Total | 6 100.00 |

twin | Freq. Percent Cum. | |

1 | 2 33.33 33.33 | |

2 | 2 33.33 66.67 | |

3 | 2 33.33 100.00 | |

Total | 6 100.00 |

woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5 | |||

1. | 1 1 0 0 0 0 0 1 1 0 0 1 0 0 | ||

2. | 2 0 1 0 0 0 0 1 1 0 0 -1 0 0 | ||

3. | 3 0 0 1 0 0 0 2 0 1 0 0 1 0 | ||

4. | 4 0 0 0 1 0 0 2 0 1 0 0 -1 0 | ||

5. | 5 0 0 0 0 1 0 3 0 0 1 0 0 1 | ||

6. | 6 0 0 0 0 0 1 3 0 0 1 0 0 -1 | ||

Source | Partial SS df MS F Prob > F | ||||||

Model | .73544635 5 .14708927 1.84 0.1183 | ||||||

woman | .73544635 5 .14708927 1.84 0.1183 | ||||||

twin | 0 0 | ||||||

Residual | 5.04305934 63 .080048561 | ||||||

Total | 5.77850569 68 .084978025 |

Source | SS df MS | Number of obs = 69 | |

F( 5, 63) = 1.84 | |||

Model | .73544635 5 .14708927 | Prob > F = 0.1183 | |

Residual | 5.04305934 63 .080048561 | R-squared = 0.1273 | |

Adj R-squared = 0.0580 | |||

Total | 5.77850569 68 .084978025 | Root MSE = .28293 |

y | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |

w1 | 0 (omitted) | |

w2 | -.1064115 .118101 -0.90 0.371 -.3424176 .1295946 | |

w3 | 0 (omitted) | |

w4 | -.0591048 .1155051 -0.51 0.611 -.2899233 .1717137 | |

w5 | .3119168 .1206411 2.59 0.012 .0708348 .5529989 | |

t1 | 0 (omitted) | |

t2 | -.1238703 .118101 -1.05 0.298 -.3598764 .1121358 | |

t3 | -.2724871 .1206411 -2.26 0.027 -.5135692 -.031405 | |

_cons | .5838711 .0853062 6.84 0.000 .4134004 .7543419 | |

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin womanNumber of obs = 69 R-squared = 0.1273 Root MSE = .282929 Adj R-squared = 0.0580

Source | Partial SS df MS F Prob > F | ||||||

Model | .73544635 5 .14708927 1.84 0.1183 | ||||||

twin | .425327562 2 .212663781 2.66 0.0780 | ||||||

woman | .621053463 3 .207017821 2.59 0.0609 | ||||||

Residual | 5.04305934 63 .080048561 | ||||||

Total | 5.77850569 68 .084978025 |

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#womanNumber of obs = 69 R-squared = 0.1273 Root MSE = .282929 Adj R-squared = 0.0580

Source | Partial SS df MS F Prob > F | ||||||

Model | .73544635 5 .14708927 1.84 0.1183 | ||||||

twin | .120036739 2 .06001837 0.75 0.4766 | ||||||

twin#woman | .621053463 3 .207017821 2.59 0.0609 | ||||||

Residual | 5.04305934 63 .080048561 | ||||||

Total | 5.77850569 68 .084978025 |

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

Source | SS df MS | Number of obs = 69 | |

F( 5, 63) = 1.84 | |||

Model | .73544635 5 .14708927 | Prob > F = 0.1183 | |

Residual | 5.04305934 63 .080048561 | R-squared = 0.1273 | |

Adj R-squared = 0.0580 | |||

Total | 5.77850569 68 .084978025 | Root MSE = .28293 |

y | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |

t1 | .0633229 .0844129 0.75 0.456 -.1053628 .2320086 | |

t2 | -.0368941 .08351 -0.44 0.660 -.2037756 .1299874 | |

t1w1 | .0532058 .0590505 0.90 0.371 -.0647973 .1712088 | |

t2w3 | .0295524 .0577525 0.51 0.611 -.0858569 .1449617 | |

t3w5 | .1559584 .0603206 2.59 0.012 .0354174 .2764995 | |

_cons | .4673425 .0603206 7.75 0.000 .3468014 .5878835 | |

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5( 1) t1w1 = 0 ( 2) t2w3 = 0 ( 3) t3w5 = 0 F( 3, 63) = 2.59 Prob > F = 0.0609

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of **t1 = t2 = 0** is a test of

**y = t1w1 t2w3 t3w5 t1 t2**

vs.

**y = t1w1 t2w3 t3w5**

The following explains sequential SS:

. anova y twin twin#woman, seqNumber of obs = 69 R-squared = 0.1273 Root MSE = .282929 Adj R-squared = 0.0580

Source | Seq. SS df MS F Prob > F | ||||||

Model | .73544635 5 .14708927 1.84 0.1183 | ||||||

twin | .114392887 2 .057196444 0.71 0.4933 | ||||||

twin#woman | .621053463 3 .207017821 2.59 0.0609 | ||||||

Residual | 5.04305934 63 .080048561 | ||||||

Total | 5.77850569 68 .084978025 |

Source | Partial SS df MS F Prob > F | ||||||

Model | .114392887 2 .057196444 0.67 0.5169 | ||||||

twin | .114392887 2 .057196444 0.67 0.5169 | ||||||

Residual | 5.6641128 66 .085819891 | ||||||

Total | 5.77850569 68 .084978025 |

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.