 »  Home »  Resources & support »  FAQs »  The anova command and collinearity

## How does the anova command handle collinearity?

 Title The anova command and collinearity Author William Sribney, StataCorp

Here is an example that illustrates what happens.

. input woman twin

woman       twin
1.       1     1
2.       2     1
3.       3     2
4.       4     2
5.       5     3
6.       6     3
7. end

. tab woman, gen(w)

woman        Freq.     Percent        Cum.

1            1       16.67       16.67
2            1       16.67       33.33
3            1       16.67       50.00
4            1       16.67       66.67
5            1       16.67       83.33
6            1       16.67      100.00

Total            6      100.00

. tab twin, gen(t)

twin        Freq.     Percent        Cum.

1            2       33.33       33.33
2            2       33.33       66.67
3            2       33.33      100.00

Total            6      100.00

. gen t1w1 = t1*w1 - t1*w2

. gen t2w3 = t2*w3 - t2*w4

. gen t3w5 = t3*w5 - t3*w6

. list w* t*, nodisplay sep(0)

woman   w1   w2   w3   w4   w5   w6   twin   t1   t2   t3   t1w1   t2w3   t3w5

1.       1    1    0    0    0    0    0      1    1    0    0      1      0      0
2.       2    0    1    0    0    0    0      1    1    0    0     -1      0      0
3.       3    0    0    1    0    0    0      2    0    1    0      0      1      0
4.       4    0    0    0    1    0    0      2    0    1    0      0     -1      0
5.       5    0    0    0    0    1    0      3    0    0    1      0      0      1
6.       6    0    0    0    0    0    1      3    0    0    1      0      0     -1

. set seed 123

. gen x = 12 - int(2*runiform())

. expand x
(63 observations created)

. gen y = runiform()

. anova y woman twin

Number of obs =           69   R-squared     =  0.0251
Root MSE      =      .304633   Adj R-squared = -0.0523

Source     Partial SS         df         MS        F    Prob>F

Model      .15054776          5   .03010955      0.32  0.8964

woman      .15054776          5   .03010955      0.32  0.8964
twin              0          0

Residual      5.8464905         63   .09280144

Total      5.9970382         68   .08819174

. regress y w1-w5 t1-t3
note: w1 omitted because of collinearity
note: w3 omitted because of collinearity
note: t1 omitted because of collinearity

Source         SS           df       MS   Number of obs  =         69
F(5, 63)       =       0.32
Model    .150547762         5  .030109552   Prob > F       =     0.8964
Residual    5.84649045        63  .092801436   R-squared      =     0.0251
Total    5.99703821        68  .088191738   Root MSE       =     .30463

y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

w1     .0516343   .1271611     0.41   0.686    -.2024769    .3057455
w2            0  (omitted)
w3            0  (omitted)
w4     .0359635   .1298961     0.28   0.783    -.2236131      .29554
w5     .0800831    .124366     0.64   0.522    -.1684426    .3286087
t1            0  (omitted)
t2    -.0798703   .1298961    -0.61   0.541    -.3394469    .1797063
t3    -.0642359   .1271611    -0.51   0.615    -.3183471    .1898752
_cons     .5206881   .0918504     5.67   0.000     .3371397    .7042364



The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin woman

Number of obs =          69   R-squared     =  0.0251
Root MSE      =     .304633   Adj R-squared = -0.0523

Source    Partial SS         df         MS        F    Prob>F

Model     .15054776          5   .03010955      0.32  0.8964

twin     .09122261          2   .04561131      0.49  0.6140
woman     .06089443          3   .02029814      0.22  0.8831

Residual     5.8464905         63   .09280144

Total     5.9970382         68   .08819174



Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#woman

Number of obs =         69   R-squared     =  0.0251
Root MSE      =    .304633   Adj R-squared = -0.0523

Source    Partial SS        df         MS        F    Prob>F

Model     .15054776         5   .03010955      0.32  0.8964

twin      .0872024         2    .0436012      0.47  0.6273
twin#woman     .06089443         3   .02029814      0.22  0.8831

Residual     5.8464905        63   .09280144

Total     5.9970382        68   .08819174



Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

Source         SS           df       MS   Number of obs  =         69
F(5, 63)       =       0.32
Model    .150547762         5  .030109552   Prob > F       =     0.8964
Residual    5.84649045        63  .092801436   R-squared      =     0.0251
Total    5.99703821        68  .088191738   Root MSE       =     .30463

y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

t1     .0500115   .0889338     0.56   0.576    -.1277084    .2277315
t2    -.0376941   .0899165    -0.42   0.676    -.2173779    .1419896
t1w1     .0258171   .0635806     0.41   0.686    -.1012385    .1528727
t2w3    -.0179817    .064948    -0.28   0.783      -.14777    .1118066
t3w5     .0400415    .062183     0.64   0.522    -.0842213    .1643044
_cons     .4964937    .062183     7.98   0.000     .3722309    .6207565

. test t1 t2

( 1)  t1 = 0
( 2)  t2 = 0

F(  2,    63) =    0.47
Prob > F =    0.6273


I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5

( 1)  t1w1 = 0
( 2)  t2w3 = 0
( 3)  t3w5 = 0

F(  3,    63) =    0.22
Prob > F =    0.8831


Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of t1 = t2 = 0 is a test of

y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5

The following explains sequential SS:

. anova y twin twin#woman, seq

Number of obs =          69    R-squared     =  0.0251
Root MSE      =     .304633    Adj R-squared = -0.0523

Source       Seq. SS         df         MS        F    Prob>F

Model     .15054776          5   .03010955      0.32  0.8964

twin     .08965333          2   .04482667      0.48  0.6192
twin#woman     .06089443          3   .02029814      0.22  0.8831

Residual     5.8464905         63   .09280144

Total     5.9970382         68   .08819174

. anova y twin

Number of obs =           69    R-squared     =  0.0149
Root MSE      =      .299175    Adj R-squared = -0.0149

Source     Partial SS         df         MS        F    Prob>F

Model      .08965333          2   .04482667      0.50  0.6083

twin      .08965333          2   .04482667      0.50  0.6083

Residual      5.9073849         66   .08950583

Total      5.9970382         68   .08819174



The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.