Title | The anova command and collinearity | |

Author | William Sribney, StataCorp | |

Date | March 1997; updated July 2011; minor revisions April 2015 |

Here is an example that illustrates what happens.

. input woman twinwoman twin 1. 1 1 2. 2 1 3. 3 2 4. 4 2 5. 5 3 6. 6 3 7. end. tab woman, gen(w)

woman | Freq. Percent Cum. | |

1 | 1 16.67 16.67 | |

2 | 1 16.67 33.33 | |

3 | 1 16.67 50.00 | |

4 | 1 16.67 66.67 | |

5 | 1 16.67 83.33 | |

6 | 1 16.67 100.00 | |

Total | 6 100.00 |

twin | Freq. Percent Cum. | |

1 | 2 33.33 33.33 | |

2 | 2 33.33 66.67 | |

3 | 2 33.33 100.00 | |

Total | 6 100.00 |

woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5 | |||

1. | 1 1 0 0 0 0 0 1 1 0 0 1 0 0 | ||

2. | 2 0 1 0 0 0 0 1 1 0 0 -1 0 0 | ||

3. | 3 0 0 1 0 0 0 2 0 1 0 0 1 0 | ||

4. | 4 0 0 0 1 0 0 2 0 1 0 0 -1 0 | ||

5. | 5 0 0 0 0 1 0 3 0 0 1 0 0 1 | ||

6. | 6 0 0 0 0 0 1 3 0 0 1 0 0 -1 | ||

Source | Partial SS df MS F Prob>F | ||||||

Model | .15054776 5 .03010955 0.32 0.8964 | ||||||

woman | .15054776 5 .03010955 0.32 0.8964 | ||||||

twin | 0 0 | ||||||

Residual | 5.8464905 63 .09280144 | ||||||

Total | 5.9970382 68 .08819174 |

Source | SS df MS | Number of obs = 69 | |

F(5, 63) = 0.32 | |||

Model | .150547762 5 .030109552 | Prob > F = 0.8964 | |

Residual | 5.84649045 63 .092801436 | R-squared = 0.0251 | |

Adj R-squared = -0.0523 | |||

Total | 5.99703821 68 .088191738 | Root MSE = .30463 |

y | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |

w1 | .0516343 .1271611 0.41 0.686 -.2024769 .3057455 | |

w2 | 0 (omitted) | |

w3 | 0 (omitted) | |

w4 | .0359635 .1298961 0.28 0.783 -.2236131 .29554 | |

w5 | .0800831 .124366 0.64 0.522 -.1684426 .3286087 | |

t1 | 0 (omitted) | |

t2 | -.0798703 .1298961 -0.61 0.541 -.3394469 .1797063 | |

t3 | -.0642359 .1271611 -0.51 0.615 -.3183471 .1898752 | |

_cons | .5206881 .0918504 5.67 0.000 .3371397 .7042364 | |

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin womanNumber of obs = 69 R-squared = 0.0251 Root MSE = .304633 Adj R-squared = -0.0523

Source | Partial SS df MS F Prob>F | ||||||

Model | .15054776 5 .03010955 0.32 0.8964 | ||||||

twin | .09122261 2 .04561131 0.49 0.6140 | ||||||

woman | .06089443 3 .02029814 0.22 0.8831 | ||||||

Residual | 5.8464905 63 .09280144 | ||||||

Total | 5.9970382 68 .08819174 |

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#womanNumber of obs = 69 R-squared = 0.0251 Root MSE = .304633 Adj R-squared = -0.0523

Source | Partial SS df MS F Prob>F | ||||||

Model | .15054776 5 .03010955 0.32 0.8964 | ||||||

twin | .0872024 2 .0436012 0.47 0.6273 | ||||||

twin#woman | .06089443 3 .02029814 0.22 0.8831 | ||||||

Residual | 5.8464905 63 .09280144 | ||||||

Total | 5.9970382 68 .08819174 |

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

Source | SS df MS | Number of obs = 69 | |

F(5, 63) = 0.32 | |||

Model | .150547762 5 .030109552 | Prob > F = 0.8964 | |

Residual | 5.84649045 63 .092801436 | R-squared = 0.0251 | |

Adj R-squared = -0.0523 | |||

Total | 5.99703821 68 .088191738 | Root MSE = .30463 |

y | Coef. Std. Err. t P>|t| [95% Conf. Interval] | |

t1 | .0500115 .0889338 0.56 0.576 -.1277084 .2277315 | |

t2 | -.0376941 .0899165 -0.42 0.676 -.2173779 .1419896 | |

t1w1 | .0258171 .0635806 0.41 0.686 -.1012385 .1528727 | |

t2w3 | -.0179817 .064948 -0.28 0.783 -.14777 .1118066 | |

t3w5 | .0400415 .062183 0.64 0.522 -.0842213 .1643044 | |

_cons | .4964937 .062183 7.98 0.000 .3722309 .6207565 | |

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5( 1) t1w1 = 0 ( 2) t2w3 = 0 ( 3) t3w5 = 0 F( 3, 63) = 0.22 Prob > F = 0.8831

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of **t1 = t2 = 0** is a test of

**y = t1w1 t2w3 t3w5 t1 t2**

vs.

**y = t1w1 t2w3 t3w5**

The following explains sequential SS:

. anova y twin twin#woman, seqNumber of obs = 69 R-squared = 0.0251 Root MSE = .304633 Adj R-squared = -0.0523

Source | Seq. SS df MS F Prob>F | ||||||

Model | .15054776 5 .03010955 0.32 0.8964 | ||||||

twin | .08965333 2 .04482667 0.48 0.6192 | ||||||

twin#woman | .06089443 3 .02029814 0.22 0.8831 | ||||||

Residual | 5.8464905 63 .09280144 | ||||||

Total | 5.9970382 68 .08819174 |

Source | Partial SS df MS F Prob>F | ||||||

Model | .08965333 2 .04482667 0.50 0.6083 | ||||||

twin | .08965333 2 .04482667 0.50 0.6083 | ||||||

Residual | 5.9073849 66 .08950583 | ||||||

Total | 5.9970382 68 .08819174 |

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.