Stata: Data Analysis and Statistical Software
   >> Home >> Resources & support >> FAQs >> The anova command and collinearity

How does the anova command handle collinearity?

Title   The anova command and collinearity
Author William Sribney, StataCorp
Date March 1997; updated July 2011

Here is an example that illustrates what happens.

. input woman twin

         woman       twin
  1.       1     1  
  2.       2     1  
  3.       3     2  
  4.       4     2  
  5.       5     3  
  6.       6     3  
  7. end

. tab woman, gen(w)

      woman |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1       16.67       16.67
          2 |          1       16.67       33.33
          3 |          1       16.67       50.00
          4 |          1       16.67       66.67
          5 |          1       16.67       83.33
          6 |          1       16.67      100.00
------------+-----------------------------------
      Total |          6      100.00

. tab twin, gen(t)

       twin |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2       33.33       33.33
          2 |          2       33.33       66.67
          3 |          2       33.33      100.00
------------+-----------------------------------
      Total |          6      100.00

. gen t1w1 = t1*w1 - t1*w2

. gen t2w3 = t2*w3 - t2*w4

. gen t3w5 = t3*w5 - t3*w6

. list w* t*, nodisplay sep(0)

     +--------------------------------------------------------------------------------+
     | woman   w1   w2   w3   w4   w5   w6   twin   t1   t2   t3   t1w1   t2w3   t3w5 |
     |--------------------------------------------------------------------------------|
  1. |     1    1    0    0    0    0    0      1    1    0    0      1      0      0 |
  2. |     2    0    1    0    0    0    0      1    1    0    0     -1      0      0 |
  3. |     3    0    0    1    0    0    0      2    0    1    0      0      1      0 |
  4. |     4    0    0    0    1    0    0      2    0    1    0      0     -1      0 |
  5. |     5    0    0    0    0    1    0      3    0    0    1      0      0      1 |
  6. |     6    0    0    0    0    0    1      3    0    0    1      0      0     -1 |
     +--------------------------------------------------------------------------------+

. set seed 123

. gen x = 12 - int(2*runiform())

. expand x
(63 observations created)

. gen y = runiform()

. anova y woman twin

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |   .73544635     5   .14708927       1.84     0.1183
                         |
                   woman |   .73544635     5   .14708927       1.84     0.1183
                    twin |           0     0
                         |
                Residual |  5.04305934    63  .080048561   
              -----------+----------------------------------------------------
                   Total |  5.77850569    68  .084978025   

. regress y w1-w5 t1-t3
note: w1 omitted because of collinearity
note: w3 omitted because of collinearity
note: t1 omitted because of collinearity

      Source |       SS       df       MS              Number of obs =      69
-------------+------------------------------           F(  5,    63) =    1.84
       Model |   .73544635     5   .14708927           Prob > F      =  0.1183
    Residual |  5.04305934    63  .080048561           R-squared     =  0.1273
-------------+------------------------------           Adj R-squared =  0.0580
       Total |  5.77850569    68  .084978025           Root MSE      =  .28293

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          w1 |          0  (omitted)
          w2 |  -.1064115    .118101    -0.90   0.371    -.3424176    .1295946
          w3 |          0  (omitted)
          w4 |  -.0591048   .1155051    -0.51   0.611    -.2899233    .1717137
          w5 |   .3119168   .1206411     2.59   0.012     .0708348    .5529989
          t1 |          0  (omitted)
          t2 |  -.1238703    .118101    -1.05   0.298    -.3598764    .1121358
          t3 |  -.2724871   .1206411    -2.26   0.027    -.5135692    -.031405
       _cons |   .5838711   .0853062     6.84   0.000     .4134004    .7543419
------------------------------------------------------------------------------

The regress model is obviously collinear, but so was the anova model. The anova command keeps terms from left to right. Hence, it “omitted” the twin effect (i.e., all the twin dummies).

. anova y twin woman

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |   .73544635     5   .14708927       1.84     0.1183
                         |
                    twin |  .425327562     2  .212663781       2.66     0.0780
                   woman |  .621053463     3  .207017821       2.59     0.0609
                         |
                Residual |  5.04305934    63  .080048561   
              -----------+----------------------------------------------------
                   Total |  5.77850569    68  .084978025

Again, anova keeps terms from left to right; here it kept only three out of the six women dummies.

. anova y twin twin#woman

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |   .73544635     5   .14708927       1.84     0.1183
                         |
                    twin |  .120036739     2   .06001837       0.75     0.4766
              twin#woman |  .621053463     3  .207017821       2.59     0.0609
                         |
                Residual |  5.04305934    63  .080048561   
              -----------+----------------------------------------------------
                   Total |  5.77850569    68  .084978025 

Below, we do the equivalent regression.

. regress y t1 t2 t1w1 t2w3 t3w5

      Source |       SS       df       MS              Number of obs =      69
-------------+------------------------------           F(  5,    63) =    1.84
       Model |   .73544635     5   .14708927           Prob > F      =  0.1183
    Residual |  5.04305934    63  .080048561           R-squared     =  0.1273
-------------+------------------------------           Adj R-squared =  0.0580
       Total |  5.77850569    68  .084978025           Root MSE      =  .28293

------------------------------------------------------------------------------
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
          t1 |   .0633229   .0844129     0.75   0.456    -.1053628    .2320086
          t2 |  -.0368941     .08351    -0.44   0.660    -.2037756    .1299874
        t1w1 |   .0532058   .0590505     0.90   0.371    -.0647973    .1712088
        t2w3 |   .0295524   .0577525     0.51   0.611    -.0858569    .1449617
        t3w5 |   .1559584   .0603206     2.59   0.012     .0354174    .2764995
       _cons |   .4673425   .0603206     7.75   0.000     .3468014    .5878835
------------------------------------------------------------------------------

. test t1 t2

 ( 1)  t1 = 0
 ( 2)  t2 = 0

       F(  2,    63) =    0.75
            Prob > F =    0.4766

I made the interactions orthogonal, which is essentially what anova does.

. test t1w1 t2w3 t3w5

 ( 1)  t1w1 = 0
 ( 2)  t2w3 = 0
 ( 3)  t3w5 = 0

       F(  3,    63) =    2.59
            Prob > F =    0.0609

Hopefully, you understand the above Wald tests. If not, the anova partial SS and their tests are equivalent. I call them “added-last” tests.

The test of t1 = t2 = 0 is a test of

y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5

The following explains sequential SS:

. anova y twin twin#woman, seq

                           Number of obs =      69     R-squared     =  0.1273
                           Root MSE      = .282929     Adj R-squared =  0.0580

                  Source |    Seq. SS     df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |   .73544635     5   .14708927       1.84     0.1183
                         |
                    twin |  .114392887     2  .057196444       0.71     0.4933
              twin#woman |  .621053463     3  .207017821       2.59     0.0609
                         |
                Residual |  5.04305934    63  .080048561   
              -----------+----------------------------------------------------
                   Total |  5.77850569    68  .084978025   

. anova y twin

                           Number of obs =      69     R-squared     =  0.0198
                           Root MSE      =  .29295     Adj R-squared = -0.0099

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  .114392887     2  .057196444       0.67     0.5169
                         |
                    twin |  .114392887     2  .057196444       0.67     0.5169
                         |
                Residual |   5.6641128    66  .085819891   
              -----------+----------------------------------------------------
                   Total |  5.77850569    68  .084978025

The twin SS are the same in the two preceding anovas. The difference in the tests is in the denominator of the F. The residuals are obviously different. I (and my profs) prefer the second for testing “main effects”.

Clearly, I take a model-building approach to anova and think in terms of the equivalent regression.

You can type regress after running anova to view an equivalent regression.

Bookmark and Share 
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
Like us on Facebook Follow us on Twitter Follow us on LinkedIn Google+ Watch us on YouTube
Follow us
© Copyright 1996–2013 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index   |   View mobile site