FAQ: The anova command and collinearity

Home / Resources & support / FAQs / The anova command and collinearity

How does the anova command handle collinearity?

Title		The anova command and collinearity
Author		William Sribney, StataCorp

Here is an example that illustrates what happens.

. input woman twin

         woman       twin
  1.       1     1  
  2.       2     1  
  3.       3     2  
  4.       4     2  
  5.       5     3  
  6.       6     3  
  7. end

. tab woman, gen(w)


      woman        Freq.     Percent        Cum.
   
          1            1       16.67       16.67
          2            1       16.67       33.33
          3            1       16.67       50.00
          4            1       16.67       66.67
          5            1       16.67       83.33
          6            1       16.67      100.00
   
      Total            6      100.00


. tab twin, gen(t)


       twin        Freq.     Percent        Cum.
   
          1            2       33.33       33.33
          2            2       33.33       66.67
          3            2       33.33      100.00
   
      Total            6      100.00


. gen t1w1 = t1*w1 - t1*w2

. gen t2w3 = t2*w3 - t2*w4

. gen t3w5 = t3*w5 - t3*w6

. list w* t*, nodisplay sep(0)



   woman   w1   w2   w3   w4   w5   w6   twin   t1   t2   t3   t1w1   t2w3   t3w5 

  1.       1    1    0    0    0    0    0      1    1    0    0      1      0      0 
  2.       2    0    1    0    0    0    0      1    1    0    0     -1      0      0 
  3.       3    0    0    1    0    0    0      2    0    1    0      0      1      0 
  4.       4    0    0    0    1    0    0      2    0    1    0      0     -1      0 
  5.       5    0    0    0    0    1    0      3    0    0    1      0      0      1 
  6.       6    0    0    0    0    0    1      3    0    0    1      0      0     -1 



. set seed 123

. gen x = 12 - int(2*runiform())

. expand x
(63 observations created)

. gen y = runiform()

. anova y woman twin

                           Number of obs =           69   R-squared     =  0.0251
                           Root MSE      =      .304633   Adj R-squared = -0.0523



                    Source     Partial SS         df         MS        F    Prob>F
               
                     Model      .15054776          5   .03010955      0.32  0.8964
                       
                     woman      .15054776          5   .03010955      0.32  0.8964
                      twin              0          0
                       
                  Residual      5.8464905         63   .09280144   
               
                     Total      5.9970382         68   .08819174   



. regress y w1-w5 t1-t3
note: w1 omitted because of collinearity
note: w3 omitted because of collinearity
note: t1 omitted because of collinearity


      Source         SS           df       MS    Number of obs  =         69
      F(5, 63)       =       0.32
       Model    .150547762         5  .030109552    Prob > F       =     0.8964
    Residual    5.84649045        63  .092801436    R-squared      =     0.0251
      Adj R-squared  =    -0.0523
       Total    5.99703821        68  .088191738    Root MSE       =     .30463




           y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   
          w1     .0516343   .1271611     0.41   0.686    -.2024769    .3057455
          w2            0  (omitted)
          w3            0  (omitted)
          w4     .0359635   .1298961     0.28   0.783    -.2236131      .29554
          w5     .0800831    .124366     0.64   0.522    -.1684426    .3286087
          t1            0  (omitted)
          t2    -.0798703   .1298961    -0.61   0.541    -.3394469    .1797063
          t3    -.0642359   .1271611    -0.51   0.615    -.3183471    .1898752
       _cons     .5206881   .0918504     5.67   0.000     .3371397    .7042364

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (for example, all the twin dummies).

. anova y twin woman

                           Number of obs =          69   R-squared     =  0.0251
                           Root MSE      =     .304633   Adj R-squared = -0.0523


                    Source    Partial SS         df         MS        F    Prob>F
               
                     Model     .15054776          5   .03010955      0.32  0.8964
                        
                      twin     .09122261          2   .04561131      0.49  0.6140
                     woman     .06089443          3   .02029814      0.22  0.8831
                        
                  Residual     5.8464905         63   .09280144   
               
                     Total     5.9970382         68   .08819174

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#woman

                           Number of obs =         69   R-squared     =  0.0251
                           Root MSE      =    .304633   Adj R-squared = -0.0523


                    Source    Partial SS        df         MS        F    Prob>F
               
                   Model     .15054776         5   .03010955      0.32  0.8964
               
                    twin      .0872024         2    .0436012      0.47  0.6273
              twin#woman     .06089443         3   .02029814      0.22  0.8831
               
                Residual     5.8464905        63   .09280144   
               
                   Total     5.9970382        68   .08819174

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5


      Source         SS           df       MS    Number of obs  =         69
      F(5, 63)       =       0.32
       Model    .150547762         5  .030109552    Prob > F       =     0.8964
    Residual    5.84649045        63  .092801436    R-squared      =     0.0251
      Adj R-squared  =    -0.0523
       Total    5.99703821        68  .088191738    Root MSE       =     .30463




           y        Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
   
          t1     .0500115   .0889338     0.56   0.576    -.1277084    .2277315
          t2    -.0376941   .0899165    -0.42   0.676    -.2173779    .1419896
        t1w1     .0258171   .0635806     0.41   0.686    -.1012385    .1528727
        t2w3    -.0179817    .064948    -0.28   0.783      -.14777    .1118066
        t3w5     .0400415    .062183     0.64   0.522    -.0842213    .1643044
       _cons     .4964937    .062183     7.98   0.000     .3722309    .6207565



. test t1 t2

 ( 1)  t1 = 0
 ( 2)  t2 = 0

       F(  2,    63) =    0.47
            Prob > F =    0.6273

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5

 ( 1)  t1w1 = 0
 ( 2)  t2w3 = 0
 ( 3)  t3w5 = 0

       F(  3,    63) =    0.22
            Prob > F =    0.8831

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of t1 = t2 = 0 is a test of

y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5

The following explains sequential SS:

. anova y twin twin#woman, seq

                           Number of obs =          69    R-squared     =  0.0251
                           Root MSE      =     .304633    Adj R-squared = -0.0523


                  Source       Seq. SS         df         MS        F    Prob>F
               
                     Model     .15054776          5   .03010955      0.32  0.8964
                        
                      twin     .08965333          2   .04482667      0.48  0.6192
                twin#woman     .06089443          3   .02029814      0.22  0.8831
                        
                  Residual     5.8464905         63   .09280144   
               
                     Total     5.9970382         68   .08819174   


. anova y twin

                              Number of obs =           69    R-squared     =  0.0149
                              Root MSE      =      .299175    Adj R-squared = -0.0149

                    Source     Partial SS         df         MS        F    Prob>F
               
                        Model      .08965333          2   .04482667      0.50  0.6083
                        
                         twin      .08965333          2   .04482667      0.50  0.6083
                        
                     Residual      5.9073849         66   .08950583   
               
                        Total      5.9970382         68   .08819174

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.

How does the anova command handle collinearity?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

woman		Freq. Percent Cum.

1		1 16.67 16.67
2		1 16.67 33.33
3		1 16.67 50.00
4		1 16.67 66.67
5		1 16.67 83.33
6		1 16.67 100.00

Total		6 100.00

twin		Freq. Percent Cum.

1		2 33.33 33.33
2		2 33.33 66.67
3		2 33.33 100.00

Total		6 100.00


		woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5

1.		1 1 0 0 0 0 0 1 1 0 0 1 0 0
2.		2 0 1 0 0 0 0 1 1 0 0 -1 0 0
3.		3 0 0 1 0 0 0 2 0 1 0 0 1 0
4.		4 0 0 0 1 0 0 2 0 1 0 0 -1 0
5.		5 0 0 0 0 1 0 3 0 0 1 0 0 1
6.		6 0 0 0 0 0 1 3 0 0 1 0 0 -1

Source		Partial SS df MS F Prob>F

	Model	.15054776 5 .03010955 0.32 0.8964

	woman	.15054776 5 .03010955 0.32 0.8964
	twin	0 0

	Residual	5.8464905 63 .09280144

	Total	5.9970382 68 .08819174

Source	SS df MS	Number of obs = 69
		F(5, 63) = 0.32
Model	.150547762 5 .030109552	Prob > F = 0.8964
Residual	5.84649045 63 .092801436	R-squared = 0.0251
		Adj R-squared = -0.0523
Total	5.99703821 68 .088191738	Root MSE = .30463


y		Coef. Std. Err. t P>\|t\| [95% Conf. Interval]

w1		.0516343 .1271611 0.41 0.686 -.2024769 .3057455
w2		0 (omitted)
w3		0 (omitted)
w4		.0359635 .1298961 0.28 0.783 -.2236131 .29554
w5		.0800831 .124366 0.64 0.522 -.1684426 .3286087
t1		0 (omitted)
t2		-.0798703 .1298961 -0.61 0.541 -.3394469 .1797063
t3		-.0642359 .1271611 -0.51 0.615 -.3183471 .1898752
_cons		.5206881 .0918504 5.67 0.000 .3371397 .7042364


y		Coef. Std. Err. t P>\|t\| [95% Conf. Interval]

t1		.0500115 .0889338 0.56 0.576 -.1277084 .2277315
t2		-.0376941 .0899165 -0.42 0.676 -.2173779 .1419896
t1w1		.0258171 .0635806 0.41 0.686 -.1012385 .1528727
t2w3		-.0179817 .064948 -0.28 0.783 -.14777 .1118066
t3w5		.0400415 .062183 0.64 0.522 -.0842213 .1643044
_cons		.4964937 .062183 7.98 0.000 .3722309 .6207565

Source		Seq. SS df MS F Prob>F

	Model	.15054776 5 .03010955 0.32 0.8964

	twin	.08965333 2 .04482667 0.48 0.6192
	twin#woman	.06089443 3 .02029814 0.22 0.8831

	Residual	5.8464905 63 .09280144

	Total	5.9970382 68 .08819174

Stata/MP4 Annual License (download)

How does the anova command handle collinearity?

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies