[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
st: Anova and Contrasts with missing cells

From	"Steichen, Thomas J." <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	st: Anova and Contrasts with missing cells
Date	Fri, 24 Oct 2008 17:18:33 -0400
Listmembers,

I have a question about contrasts after ANOVA for the following example dataset, which I also summarize via -table- below:

input round size nnn nnn_adjm
1 600 .532 .532
1 600 .573 .573
1 600 .581 .581
2 600 .609 .609
2 600 .465 .465
2 600 .593 .593
3 400 .413 .5756667
3 400 .406 .5686666
3 400 .418 .5806667
3 800 .725 .5623333
3 800 .815 .6523333
3 800 .673 .5103333
4 600 .552 .552
4 600 .585 .585
4 600 .588 .588
4 600 .733 .733
4 600 .608 .608
5 600 .640 .640
5 600 .643 .643
5 600 .906 .906
5 600 .853 .853
5 600 .847 .847
end

. table round size, c(mean nnn sd nnn n nnn) row col for(%7.3f)

--------------------------------------
          |            size
    round |   400    600    800  Total
----------+---------------------------
        1 |        0.562         0.562
          |        0.026         0.026
          |            3             3
          |
        2 |        0.556         0.556
          |        0.079         0.079
          |            3             3
          |
        3 | 0.412         0.738  0.575
          | 0.006         0.072  0.184
          |     3             3      6
          |
        4 |        0.613         0.613
          |        0.070         0.070
          |            5             5
          |
        5 |        0.778         0.778
          |        0.127         0.127
          |            5             5
          |
    Total | 0.412  0.644  0.738  0.625
          | 0.006  0.125  0.072  0.142
          |     3     16      3     22
--------------------------------------

The interesting feature of this dataset is that round 3 has data at two 'size' levels, that differ from the single 'size' used at all other rounds. It is also notable that the sample size for the data in rounds 4 and 5 differs from that in rounds 1, 2 and 3.

If one does an ANOVA for the nnn data followed by contrasts, any contrast not involving round 3 seems reasonable; however, those involving 3 seem dubious. The examples below show the ANOVA and three example contrasts.

. anova nnn round size|round
             Number of obs =      22     R-squared     =  0.7465
             Root MSE      = .082094     Adj R-squared =  0.6673

    Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
     Model |  .317523495     5  .063504699       9.42     0.0002
           |
     round |  .158760825     4  .039690206       5.89     0.0041
size|round |   .15876267     1   .15876267      23.56     0.0002
           |
  Residual |  .107829606    16   .00673935
-----------+----------------------------------------------------
     Total |  .425353101    21   .02025491

. test _coef[round[1]] = _coef[round[2]]

 ( 1)  round[1] - round[2] = 0

       F(  1,    16) =    0.01
            Prob > F =    0.9259

. test _coef[round[1]] = _coef[round[3]]

 ( 1)  round[1] - round[3] = 0

       F(  1,    16) =    6.87
            Prob > F =    0.0185

. test _coef[round[1]] = _coef[round[4]]

 ( 1)  round[1] - round[4] = 0

       F(  1,    16) =    0.73
            Prob > F =    0.4057

What is odd about this second contrast is that the means of rounds 1 and 3 differ by only 0.013 units (those in the first contrast differ by 0.006 and in the third by 0.051, with fairly similar sd's). So why is contrast 2 significant? (In fact, any contrast involving round 3 seems wrong.)

To explore this further, I created a new variable nnn_adjm, where _adjm stands for adjusted mean.

The adjustment is, for round 3 alone, to adjust the two 'size' subsets to have the same mean. (For other rounds, the values are retained as is.) In psuedo code, something like:

   gen nnn_adjm(i) = nnn(i) - mean(nnn|size(j)) + mean(nnn)

That is, we subtract from each observation i the mean for its size subset j and add the grand mean (over both sizes). This, effectively, is what ANOVA does to account for the size effect.

This gives us the following summary stats:

. table round size, c(mean nnn_adjm sd nnn_adjm n nnn_adjm) row col for(%7.3f)

--------------------------------------
          |            size
    round |   400    600    800  Total
----------+---------------------------
        1 |        0.562         0.562
          |        0.026         0.026
          |            3             3
          |
        2 |        0.556         0.556
          |        0.079         0.079
          |            3             3
          |
        3 | 0.575         0.575  0.575
          | 0.006         0.072  0.046
          |     3             3      6
          |
        4 |        0.613         0.613
          |        0.070         0.070
          |            5             5
          |
        5 |        0.778         0.778
          |        0.127         0.127
          |            5             5
          |
    Total | 0.575  0.644  0.575  0.625
          | 0.006  0.125  0.072  0.113
          |     3     16      3     22
--------------------------------------

Note that the two size categories in round 3 now have the same mean but retain their sd's from before adjustment.

Now, if we repeat the ANOVA and contrasts on this adjusted variable, we get:

. anova nnn_adjm round size|round

             Number of obs =      22     R-squared     =  0.5955
             Root MSE      = .082094     Adj R-squared =  0.4691

    Source |  Partial SS    df       MS           F     Prob > F
-----------+----------------------------------------------------
     Model |  .158760831     5  .031752166       4.71     0.0078
           |
     round |  .158760831     4  .039690208       5.89     0.0041
size|round |           0     1           0       0.00     1.0000
           |
  Residual |  .107829606    16   .00673935
-----------+----------------------------------------------------
     Total |  .266590437    21  .012694783

. test _coef[round[1]] = _coef[round[2]]

 ( 1)  round[1] - round[2] = 0

       F(  1,    16) =    0.01
            Prob > F =    0.9259

. test _coef[round[1]] = _coef[round[3]]

 ( 1)  round[1] - round[3] = 0

       F(  1,    16) =    0.04
            Prob > F =    0.8487

. test _coef[round[1]] = _coef[round[4]]

 ( 1)  round[1] - round[4] = 0

       F(  1,    16) =    0.73
            Prob > F =    0.4057

As expected, in the ANOVA the sum of squares for size|round is zero and the SS for round and residual are the same as before (less a little meaningless roundoff error).

Likewise, contrasts not involving round 3 are identical to the unadjusted data, but the one involving round 3 has greatly changed (from p = 0.0185 to p = 0.8487). These adjusted results seem much more reasonable (as do any other contrasts involving round 3).

If one compares these contrast results to what SAS or JMP produce, those not involving round 3 are identical to those of Stata. However, both SAS and JMP produce p = 0.8256 for the second contrast above. Generally, SAS and JMP produce p's for contrasts involving round 3 that are close, but different, to those produced by Stata using the 'adjusted' data above. Also, SAS and JMP produce identical results using the raw vs. adjusted data (whether round 3 is involved in the contrast or not).

I will speculate that difference in answers is due to the unequal sample sizes and/or the cells with no data. But the question remains: which is correct?

Tom


-----------------------------------
Thomas J. Steichen
[email protected]
-----------------------------------


CONFIDENTIALITY  NOTE:  This e-mail message, including any attachment(s), contains information that may be confidential, protected by the attorney-client or other legal privileges, and/or proprietary non-public information. If you are not an intended recipient of this message or an authorized assistant to an intended recipient, please notify the sender by replying to this message and then delete it from your system. Use, dissemination, distribution, or reproduction of this message and/or any of its attachments (if any) by unintended recipients is not authorized and may be unlawful.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: RE: Get a probability response curve after probit/logit regression?
Next by Date: Re: st: Re: problem with -artsurv-
Previous by thread: st: problem with -artsurv-
Next by thread: Re: st: Anova and Contrasts with missing cells
Index(es):
- Date
- Thread