How does the anova command handle collinearity?
| Title |
|
The anova command and collinearity |
| Author |
William Sribney, StataCorp |
| Date |
March 1997; minor revisions July 2007 |
Here’s an example that illustrates what happens.
. input woman twin
woman twin
1. 1 1
2. 2 1
3. 3 2
4. 4 2
5. 5 3
6. 6 3
7. end
. tab woman, gen(w)
woman | Freq. Percent Cum.
------------+-----------------------------------
1 | 1 16.67 16.67
2 | 1 16.67 33.33
3 | 1 16.67 50.00
4 | 1 16.67 66.67
5 | 1 16.67 83.33
6 | 1 16.67 100.00
------------+-----------------------------------
Total | 6 100.00
. tab twin, gen(t)
twin | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 33.33 33.33
2 | 2 33.33 66.67
3 | 2 33.33 100.00
------------+-----------------------------------
Total | 6 100.00
. gen t1w1 = t1*w1 - t1*w2
. gen t2w3 = t2*w3 - t2*w4
. gen t3w5 = t3*w5 - t3*w6
. list w* t*, nodisplay
+--------------------------------------------------------------------------------+
| woman w1 w2 w3 w4 w5 w6 twin t1 t2 t3 t1w1 t2w3 t3w5 |
|--------------------------------------------------------------------------------|
1. | 1 1 0 0 0 0 0 1 1 0 0 1 0 0 |
2. | 2 0 1 0 0 0 0 1 1 0 0 -1 0 0 |
3. | 3 0 0 1 0 0 0 2 0 1 0 0 1 0 |
4. | 4 0 0 0 1 0 0 2 0 1 0 0 -1 0 |
5. | 5 0 0 0 0 1 0 3 0 0 1 0 0 1 |
|--------------------------------------------------------------------------------|
6. | 6 0 0 0 0 0 1 3 0 0 1 0 0 -1 |
+--------------------------------------------------------------------------------+
. gen x = 12 - int(2*uniform())
. expand x
(62 observations created)
. set seed 123
. gen y = uniform()
. anova y woman twin
Number of obs = 70 R-squared = 0.0801
Root MSE = .288572 Adj R-squared = 0.0082
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .463881941 5 .092776388 1.11 0.3618
|
woman | .463881941 5 .092776388 1.11 0.3618
twin | 0 0
|
Residual | 5.32951143 64 .083273616
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
. regress y w1-w5 t1-t3
Source | SS df MS Number of obs = 70
---------+------------------------------ F( 5, 64) = 1.11
Model | .463881941 5 .092776388 Prob > F = 0.3618
Residual | 5.32951143 64 .083273616 R-squared = 0.0801
---------+------------------------------ Adj R-squared = 0.0082
Total | 5.79339337 69 .083962223 Root MSE = .28857
--------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+----------------------------------------------------------------
w1 | (dropped)
w2 | -.0548665 .1204566 -0.46 0.650 -.2955063 .1857732
w3 | -.0594359 .1178089 -0.50 0.616 -.2947862 .1759144
w4 | (dropped)
w5 | .209396 .1204566 1.74 0.087 -.0312437 .4500358
t1 | (dropped)
t2 | -.1140098 .1178089 -0.97 0.337 -.3493601 .1213405
t3 | -.1589105 .1178089 -1.35 0.182 -.3942608 .0764398
_cons | .5604848 .0833035 6.73 0.000 .394067 .7269026
--------------------------------------------------------------------------
The regress model is obviously collinear, but so was the anova model. The
anova command
keeps terms from left to right. Hence, it “dropped” the twin
effect (i.e., all the twin dummies).
. anova y twin woman
Number of obs = 70 R-squared = 0.0801
Root MSE = .288572 Adj R-squared = 0.0082
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .463881941 5 .092776388 1.11 0.3618
|
twin | .062313132 2 .031156566 0.37 0.6894
woman | .290114367 3 .096704789 1.16 0.3315
|
Residual | 5.32951143 64 .083273616
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
Again, anova keeps terms from left to right; here
it kept only three out of the six women dummies.
. anova y twin twin*woman
Number of obs = 70 R-squared = 0.0801
Root MSE = .288572 Adj R-squared = 0.0082
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .463881941 5 .092776388 1.11 0.3618
|
twin | .175133264 2 .087566632 1.05 0.3554
twin*woman | .290114367 3 .096704789 1.16 0.3315
|
Residual | 5.32951143 64 .083273616
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
Below, we do the equivalent regression.
. regress y t1 t2 t1w1 t2w3 t3w5
Source | SS df MS Number of obs = 70
---------+------------------------------ F( 5, 64) = 1.11
Model | .463881941 5 .092776388 Prob > F = 0.3618
Residual | 5.32951143 64 .083273616 R-squared = 0.0801
---------+------------------------------ Adj R-squared = 0.0082
Total | 5.79339337 69 .083962223 Root MSE = .28857
--------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------+----------------------------------------------------------------
t1 | .0267792 .0851757 0.31 0.754 -.1433788 .1969372
t2 | -.0895153 .0842448 -1.06 0.292 -.2578136 .078783
t1w1 | .0274333 .0602283 0.46 0.650 -.0928866 .1477531
t2w3 | -.029718 .0589044 -0.50 0.616 -.1473931 .0879572
t3w5 | .104698 .0602283 1.74 0.087 -.0156219 .2250179
_cons | .5062724 .0602283 8.41 0.000 .3859525 .6265922
--------------------------------------------------------------------------
. test t1 t2
( 1) t1 = 0.0
( 2) t2 = 0.0
F( 2, 62) = 1.05
Prob > F = 0.3554
. test t1w1 t2w3 t3w5
I made the interactions orthogonal, which is essentially what
anova does.
( 1) t1w1 = 0.0
( 2) t2w3 = 0.0
( 3) t3w5 = 0.0
F( 3, 62) = 1.16
Prob > F = 0.3315
You understand the above Wald tests. The anova partial SS and their tests
are equivalent. I call them “added-last” tests for obvious
reason.
The test of t1 = t2 = 0 is a test of
y = t1w1 t2w3 t3w5 t1 t2
vs.
y = t1w1 t2w3 t3w5
(Comment: It’s kind of a stupid test in this case. Obviously, partial
SS and their tests make more sense for different covariates rather than
interactions and main effects.)
The following explains sequential SS:
. anova y twin twin*woman, seq
Number of obs = 70 R-squared = 0.0801
Root MSE = .288572 Adj R-squared = 0.0082
Source | Seq. SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .463881941 5 .092776388 1.11 0.3618
|
twin | .173767574 2 .086883787 1.04 0.3582
twin*woman | .290114367 3 .096704789 1.16 0.3315
|
Residual | 5.32951143 64 .083273616
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
. anova y twin
Number of obs = 70 R-squared = 0.0300
Root MSE = .289612 Adj R-squared = 0.0010
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .173767574 2 .086883787 1.04 0.3605
|
twin | .173767574 2 .086883787 1.04 0.3605
|
Residual | 5.6196258 67 .083875012
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
The twin SS are the same in the two preceding
anovas. The difference in the tests is in the
denominator of the F. The residuals are obviously
different. I (and my profs) prefer the second for testing “main
effects”.
Clearly, I take a model-building approach to anova and think in terms of the
equivalent regression.
You can type regress
after running anova to view an equivalent regression.
When using interactions in anova, it always
includes main effects for interactions, even if you don’t explicitly
do so.
. anova y twin*woman
Number of obs = 70 R-squared = 0.0801
Root MSE = .288572 Adj R-squared = 0.0082
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .463881941 5 .092776388 1.11 0.3618
|
twin*woman | .463881941 5 .092776388 1.11 0.3618
|
Residual | 5.32951143 64 .083273616
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
sequential does the same.
. anova y twin*woman twin, seq
Number of obs = 70 R-squared = 0.0801
Root MSE = .288572 Adj R-squared = 0.0082
Source | Seq. SS df MS F Prob > F
-----------+----------------------------------------------------
Model | .463881941 5 .092776388 1.11 0.3618
|
twin*woman | .463881941 5 .092776388 1.11 0.3618
twin | 0 0
|
Residual | 5.32951143 64 .083273616
-----------+----------------------------------------------------
Total | 5.79339337 69 .083962223
|