st: Anova and Contrasts with missing cells

 From "Steichen, Thomas J." To "'statalist@hsphsun2.harvard.edu'" Subject st: Anova and Contrasts with missing cells Date Fri, 24 Oct 2008 17:18:33 -0400

Listmembers,

I have a question about contrasts after ANOVA for the following example dataset, which I also summarize via -table- below:

1 600 .532 .532
1 600 .573 .573
1 600 .581 .581
2 600 .609 .609
2 600 .465 .465
2 600 .593 .593
3 400 .413 .5756667
3 400 .406 .5686666
3 400 .418 .5806667
3 800 .725 .5623333
3 800 .815 .6523333
3 800 .673 .5103333
4 600 .552 .552
4 600 .585 .585
4 600 .588 .588
4 600 .733 .733
4 600 .608 .608
5 600 .640 .640
5 600 .643 .643
5 600 .906 .906
5 600 .853 .853
5 600 .847 .847
end

. table round size, c(mean nnn sd nnn n nnn) row col for(%7.3f)

--------------------------------------
|            size
round |   400    600    800  Total
----------+---------------------------
1 |        0.562         0.562
|        0.026         0.026
|            3             3
|
2 |        0.556         0.556
|        0.079         0.079
|            3             3
|
3 | 0.412         0.738  0.575
| 0.006         0.072  0.184
|     3             3      6
|
4 |        0.613         0.613
|        0.070         0.070
|            5             5
|
5 |        0.778         0.778
|        0.127         0.127
|            5             5
|
Total | 0.412  0.644  0.738  0.625
| 0.006  0.125  0.072  0.142
|     3     16      3     22
--------------------------------------

The interesting feature of this dataset is that round 3 has data at two 'size' levels, that differ from the single 'size' used at all other rounds. It is also notable that the sample size for the data in rounds 4 and 5 differs from that in rounds 1, 2 and 3.

If one does an ANOVA for the nnn data followed by contrasts, any contrast not involving round 3 seems reasonable; however, those involving 3 seem dubious. The examples below show the ANOVA and three example contrasts.

. anova nnn round size|round
Number of obs =      22     R-squared     =  0.7465
Root MSE      = .082094     Adj R-squared =  0.6673

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  .317523495     5  .063504699       9.42     0.0002
|
round |  .158760825     4  .039690206       5.89     0.0041
size|round |   .15876267     1   .15876267      23.56     0.0002
|
Residual |  .107829606    16   .00673935
-----------+----------------------------------------------------
Total |  .425353101    21   .02025491

. test _coef[round[1]] = _coef[round[2]]

( 1)  round[1] - round[2] = 0

F(  1,    16) =    0.01
Prob > F =    0.9259

. test _coef[round[1]] = _coef[round[3]]

( 1)  round[1] - round[3] = 0

F(  1,    16) =    6.87
Prob > F =    0.0185

. test _coef[round[1]] = _coef[round[4]]

( 1)  round[1] - round[4] = 0

F(  1,    16) =    0.73
Prob > F =    0.4057

What is odd about this second contrast is that the means of rounds 1 and 3 differ by only 0.013 units (those in the first contrast differ by 0.006 and in the third by 0.051, with fairly similar sd's). So why is contrast 2 significant? (In fact, any contrast involving round 3 seems wrong.)

The adjustment is, for round 3 alone, to adjust the two 'size' subsets to have the same mean. (For other rounds, the values are retained as is.) In psuedo code, something like:

gen nnn_adjm(i) = nnn(i) - mean(nnn|size(j)) + mean(nnn)

That is, we subtract from each observation i the mean for its size subset j and add the grand mean (over both sizes). This, effectively, is what ANOVA does to account for the size effect.

This gives us the following summary stats:

--------------------------------------
|            size
round |   400    600    800  Total
----------+---------------------------
1 |        0.562         0.562
|        0.026         0.026
|            3             3
|
2 |        0.556         0.556
|        0.079         0.079
|            3             3
|
3 | 0.575         0.575  0.575
| 0.006         0.072  0.046
|     3             3      6
|
4 |        0.613         0.613
|        0.070         0.070
|            5             5
|
5 |        0.778         0.778
|        0.127         0.127
|            5             5
|
Total | 0.575  0.644  0.575  0.625
| 0.006  0.125  0.072  0.113
|     3     16      3     22
--------------------------------------

Note that the two size categories in round 3 now have the same mean but retain their sd's from before adjustment.

Now, if we repeat the ANOVA and contrasts on this adjusted variable, we get:

Number of obs =      22     R-squared     =  0.5955
Root MSE      = .082094     Adj R-squared =  0.4691

Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
Model |  .158760831     5  .031752166       4.71     0.0078
|
round |  .158760831     4  .039690208       5.89     0.0041
size|round |           0     1           0       0.00     1.0000
|
Residual |  .107829606    16   .00673935
-----------+----------------------------------------------------
Total |  .266590437    21  .012694783

. test _coef[round[1]] = _coef[round[2]]

( 1)  round[1] - round[2] = 0

F(  1,    16) =    0.01
Prob > F =    0.9259

. test _coef[round[1]] = _coef[round[3]]

( 1)  round[1] - round[3] = 0

F(  1,    16) =    0.04
Prob > F =    0.8487

. test _coef[round[1]] = _coef[round[4]]

( 1)  round[1] - round[4] = 0

F(  1,    16) =    0.73
Prob > F =    0.4057

As expected, in the ANOVA the sum of squares for size|round is zero and the SS for round and residual are the same as before (less a little meaningless roundoff error).

Likewise, contrasts not involving round 3 are identical to the unadjusted data, but the one involving round 3 has greatly changed (from p = 0.0185 to p = 0.8487). These adjusted results seem much more reasonable (as do any other contrasts involving round 3).

If one compares these contrast results to what SAS or JMP produce, those not involving round 3 are identical to those of Stata. However, both SAS and JMP produce p = 0.8256 for the second contrast above. Generally, SAS and JMP produce p's for contrasts involving round 3 that are close, but different, to those produced by Stata using the 'adjusted' data above. Also, SAS and JMP produce identical results using the raw vs. adjusted data (whether round 3 is involved in the contrast or not).

I will speculate that difference in answers is due to the unequal sample sizes and/or the cells with no data. But the question remains: which is correct?

Tom

-----------------------------------
Thomas J. Steichen
steicht@rjrt.com
-----------------------------------

CONFIDENTIALITY  NOTE:  This e-mail message, including any attachment(s), contains information that may be confidential, protected by the attorney-client or other legal privileges, and/or proprietary non-public information. If you are not an intended recipient of this message or an authorized assistant to an intended recipient, please notify the sender by replying to this message and then delete it from your system. Use, dissemination, distribution, or reproduction of this message and/or any of its attachments (if any) by unintended recipients is not authorized and may be unlawful.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/