How does the anova command handle collinearity?
| Title |
|
The anova command and collinearity |
| Author |
William Sribney, StataCorp |
| Date |
March 1997; updated July 2011 |
Here is an example that illustrates what happens.
. input woman twin
woman twin
1. 1 1
2. 2 1
3. 3 2
4. 4 2
5. 5 3
6. 6 3
7. end
. tab woman, gen(w)
woman | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 16.67 16.67
2 | 1 16.67 33.33
3 | 1 16.67 50.00
4 | 1 16.67 66.67
5 | 1 16.67 83.33
6 | 1 16.67 100.00
------------+-----------------------------------
Total | 6 100.00
. tab twin, gen(t)
twin | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 33.33 33.33
2 | 2 33.33 66.67
3 | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
. gen t1w1 = t1*w1 - t1*w2
. gen t2w3 = t2*w3 - t2*w4
. gen t3w5 = t3*w5 - t3*w6
. list w* t*, nodisplay sep(0)
+--------------------------------------------------------------------------------+
| woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5 |
|--------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 0 0 1 0 0 |
2. | 2 0 1 0 0 0 0 1 1 0 0 -1 0 0 |
3. | 3 0 0 1 0 0 0 2 0 1 0 0 1 0 |
4. | 4 0 0 0 1 0 0 2 0 1 0 0 -1 0 |
5. | 5 0 0 0 0 1 0 3 0 0 1 0 0 1 |
6. | 6 0 0 0 0 0 1 3 0 0 1 0 0 -1 |
+--------------------------------------------------------------------------------+
. set seed 123
. gen x = 12 - int(2*runiform())
. expand x
(63 observations created)
. gen y = runiform()
. anova y woman twin
Number of obs = 69 R-squared = 0.1273
Root MSE = .282929 Adj R-squared = 0.0580
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .73544635 5 .14708927 1.84 0.1183
|
woman | .73544635 5 .14708927 1.84 0.1183
twin | 0 0
|
Residual | 5.04305934 63 .080048561
-----------+----------------------------------------------------
Total | 5.77850569 68 .084978025
. regress y w1-w5 t1-t3
note: w1 omitted because of collinearity
note: w3 omitted because of collinearity
note: t1 omitted because of collinearity
Source | SS df MS Number of obs = 69
-------------+------------------------------ F( 5, 63) = 1.84
Model | .73544635 5 .14708927 Prob > F = 0.1183
Residual | 5.04305934 63 .080048561 R-squared = 0.1273
-------------+------------------------------ Adj R-squared = 0.0580
Total | 5.77850569 68 .084978025 Root MSE = .28293
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
w1 | 0 (omitted)
w2 | -.1064115 .118101 -0.90 0.371 -.3424176 .1295946
w3 | 0 (omitted)
w4 | -.0591048 .1155051 -0.51 0.611 -.2899233 .1717137
w5 | .3119168 .1206411 2.59 0.012 .0708348 .5529989
t1 | 0 (omitted)
t2 | -.1238703 .118101 -1.05 0.298 -.3598764 .1121358
t3 | -.2724871 .1206411 -2.26 0.027 -.5135692 -.031405
_cons | .5838711 .0853062 6.84 0.000 .4134004 .7543419
------------------------------------------------------------------------------
The regress model is obviously collinear, but so was the anova model. The
anova command
keeps terms from left to right. Hence, it “omitted” the twin
effect (i.e., all the twin dummies).
. anova y twin woman
Number of obs = 69 R-squared = 0.1273
Root MSE = .282929 Adj R-squared = 0.0580
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .73544635 5 .14708927 1.84 0.1183
|
twin | .425327562 2 .212663781 2.66 0.0780
woman | .621053463 3 .207017821 2.59 0.0609
|
Residual | 5.04305934 63 .080048561
-----------+----------------------------------------------------
Total | 5.77850569 68 .084978025
Again, anova keeps terms from left to right; here
it kept only three out of the six women dummies.
. anova y twin twin#woman
Number of obs = 69 R-squared = 0.1273
Root MSE = .282929 Adj R-squared = 0.0580
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .73544635 5 .14708927 1.84 0.1183
|
twin | .120036739 2 .06001837 0.75 0.4766
twin#woman | .621053463 3 .207017821 2.59 0.0609
|
Residual | 5.04305934 63 .080048561
-----------+----------------------------------------------------
Total | 5.77850569 68 .084978025
Below, we do the equivalent regression.
. regress y t1 t2 t1w1 t2w3 t3w5
Source | SS df MS Number of obs = 69
-------------+------------------------------ F( 5, 63) = 1.84
Model | .73544635 5 .14708927 Prob > F = 0.1183
Residual | 5.04305934 63 .080048561 R-squared = 0.1273
-------------+------------------------------ Adj R-squared = 0.0580
Total | 5.77850569 68 .084978025 Root MSE = .28293
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
t1 | .0633229 .0844129 0.75 0.456 -.1053628 .2320086
t2 | -.0368941 .08351 -0.44 0.660 -.2037756 .1299874
t1w1 | .0532058 .0590505 0.90 0.371 -.0647973 .1712088
t2w3 | .0295524 .0577525 0.51 0.611 -.0858569 .1449617
t3w5 | .1559584 .0603206 2.59 0.012 .0354174 .2764995
_cons | .4673425 .0603206 7.75 0.000 .3468014 .5878835
------------------------------------------------------------------------------
. test t1 t2
( 1) t1 = 0
( 2) t2 = 0
F( 2, 63) = 0.75
Prob > F = 0.4766
I made the interactions orthogonal, which is essentially what
anova does.
. test t1w1 t2w3 t3w5
( 1) t1w1 = 0
( 2) t2w3 = 0
( 3) t3w5 = 0
F( 3, 63) = 2.59
Prob > F = 0.0609
Hopefully, you understand the above Wald tests.
If not, the anova partial SS and their tests
are equivalent. I call them “added-last” tests.
The test of t1 = t2 = 0 is a test of
y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5
The following explains sequential SS:
. anova y twin twin#woman, seq
Number of obs = 69 R-squared = 0.1273
Root MSE = .282929 Adj R-squared = 0.0580
Source | Seq. SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .73544635 5 .14708927 1.84 0.1183
|
twin | .114392887 2 .057196444 0.71 0.4933
twin#woman | .621053463 3 .207017821 2.59 0.0609
|
Residual | 5.04305934 63 .080048561
-----------+----------------------------------------------------
Total | 5.77850569 68 .084978025
. anova y twin
Number of obs = 69 R-squared = 0.0198
Root MSE = .29295 Adj R-squared = -0.0099
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .114392887 2 .057196444 0.67 0.5169
|
twin | .114392887 2 .057196444 0.67 0.5169
|
Residual | 5.6641128 66 .085819891
-----------+----------------------------------------------------
Total | 5.77850569 68 .084978025
The twin SS are the same in the two preceding
anovas. The difference in the tests is in the
denominator of the F. The residuals are obviously
different. I (and my profs) prefer the second for testing “main
effects”.
Clearly, I take a model-building approach to anova and think in terms of the
equivalent regression.
You can type regress
after running anova to view an equivalent regression.
|