Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Order of factors in anova affects results


From   Michael Ingre <[email protected]>
To   <[email protected]>
Subject   Re: st: Order of factors in anova affects results
Date   Thu, 02 Oct 2003 15:28:37 +0200

Thanks

I would like to thank Joseph Coveney and Ken Higbee for your quick reply and
your very appreciated help in understanding ANOVA in Stata. I also would
like to apologize for my late reply, our e-mail is very slow at the moment
(sometimes over 24h late!!!).

A few comments.

1) Yes, you were both correct in assuming that the subject factor was nested
within groups. The error term for groups should of course be subjects within
groups. Your suggestion works fine and yields the expected results.

This was a mistake of mine, and I have to blame (at least partly) my
previous work with statistical packages (SPSS and StatView) that to some
extent hides the underlying model when you specify it. You don't need to
explicitly specify the error term to use. And when using the GUI-menus all
you need to know is which factors are within and which are between.  This is
of course a way to make the statistics more user friendly. However, as a
consequence, it might widen the gap between theory and practice. It has for
me. I had to review my old literature on ANOVA in this case. Using Stata
seems to be a good way of closing the gap between theory and practice.

2) Yes, there is one missing case at one of the time-points. The design is
unbalanced. This is also a new situation for me. Typically, when using the
"wide" format (at least in SPSS type III SS) you lose the case (listwise
deletion). And this is how we usually treat unbalanced data.

The use of listwise deletion has of course it's own problems and in our
group it is sometimes handled with imputing data or averaging over several
time points (typically when there are few subjects and many time points).

Unbalanced designs are controversial. At least where I come from. I will
take a close look at the xt-procedures in Stata. They seem to bee very
useful in many of our studies. In this particular case though, I will use
the "old" model like this:

reshape wide dv ,i(sub) j(time)
generate dv_both = dv1 + dv2
reshape long dv ,i(sub) j(time)
anova dv grp / sub|grp time grp*time if dv_both < 11

3) Thanks to Stata corp and Ken Higbee for taking the time to explain in
detail how Stata handles the factors.  I at least,  will think both once and
twice before I specify an incomplete "vague" ANOVA and at least "thrice"
when trying to interpret the result.

Thanks again

Michael Ingre




Den 03-09-30 15.46, skrev "[email protected]" <[email protected]>:

> Joseph Coveney <[email protected]> correctly answers the
> ANOVA question asked by Michael Ingre <[email protected]>.
> 
>> Michael's description of the experimental design didn't
>> explicitly say so, but the behavior of Stata's -anova- upon
>> switching the variable order of the subjects and groups factors
>> in the command line indicates that subjects are nested within
>> groups.
>> 
>> If that's so, then the model specification would be groups,
>> subjects-within-groups as the error (denominator) mean square for
>> groups, then time, and finally the interaction of groups and
>> time:
>> 
>>   -anova dv grp / sub|grp time grp*time-.
>> 
>> There is a description of the syntax (including what forward
>> slashes, pipes and stars are for) in the user's manual for the
>> -anova- entry.  There's also a FAQ about repeated-measures anova
>> on StataCorp's website,
>> www.stata.com/support/faqs/stat/anova2.html .  I think that both
>> the user's manual and the FAQ have examples that use datasets
>> analogous to Michael's.
>> 
>> It appears that Michael's dataset has at least one missing value.
>> I think that this topic is broached in the user's manual in the
>> entry for -anova-.  If there is a concern over the imbalance,
>> there are alternatives to consider in this situation, such as
>> -xtreg , re-.
> 
> I would like to explain how Stata's -anova- handles the factors
> that are presented to it.  Assuming the default -partial-
> sums-of-squares, then *USUALLY* it makes no difference what order
> you specify the terms in the -anova-.
> 
> -anova- uses the sweep operator when computing the X'X inverse.
> Most ANOVA models are overparameterized, meaning that some rows
> and columns of the X'X inverse matrix will be zeroed out (for
> instance the X'X matrix might be of dimension 48, but there are
> only 20 estimable degrees of freedom in the model).  Which ones
> are zeroed out depends on what permutation (the order of the
> columns and rows) of the X'X matrix that is presented to the
> sweep operation.
> 
> What Stata does is place the factors in the X'X matrix with lower
> order terms appearing before higher order terms.  This guarantees
> that single terms get swept before 2-way interactions which get
> swept before 3-way interactions.
> 
> So, if I said
> 
>   . anova y a*b*c c a a*b b b*c a*c
> 
> (a strange way of ordering the terms) then Stata would (behind
> the scenes) create the X'X matrix with columns and rows in the
> order
> 
>   _cons c a b a*b b*c a*c a*b*c
> 
> It first takes the constant then each single term in the order
> found, then each two-way interaction in the order found, then
> each three-way interaction in the order found (only one in this
> case).  After computation, Stata would present the ANOVA table
> back in the order you specified.
> 
> In well specified ANOVA models, it does not matter how you order
> the terms.  Whether the single terms appear as
> 
>   c a b     or     a b c
> 
> or any other permutation does not matter.  The degrees of freedom
> for each of these single terms do not overlap.
> 
> Assuming Joseph's guess is correct concerning subj being nested
> in grp for Michael's ANOVA (and I think it is), then
> 
>   anova dv grp subj time time*grp
> 
> versus
> 
>   anova dv subj grp time time*grp
> 
> produces different results for subj and grp because in the first
> case Stata first sweeps grp then subject allowing grp to obtain
> all of its potential degrees of freedom before subject does.  In
> the second case, however, subj is swept first and since the
> degrees of freedom for grp are really a subset of the full
> potential number of degrees of freedom for subj, subj obtains all
> the degrees of freedom leaving none for grp.
> 
> Joseph shows that the correct way to specify nesting is with
> "subj|grp" which means "subj nested in grp"
> 
> With this specification, Stata treats "subj|grp" like a two-way
> interaction (i.e., it will be swept after the single terms).
> So, even if you did something strange like
> 
>   anova dv subj|grp grp ...
> 
> instead of
> 
>   anova dv grp subj|grp ...
> 
> Stata knows that grp gets swept first either way.
> 
> 
> I am about to make a long posting even longer.  Bail out now
> if not interested.
> 
> Many years ago I investigated what appeared to be some strange
> results that someone sent in.  They had specified a strange ANOVA
> in a couple of different packages.  All the packages reported
> something a little different.  Here is a slightly modified
> version of the answer I gave them.  I have obscured the identity
> of the other packages involved because my intent here is not to
> criticize other implementations of ANOVA, only to instruct on
> an ambiguity in the modeling language of ANOVA.
> 
> ... one of the simplest models that demonstrates the differences
> is
> 
>               anova y u*v u*w
> 
> If instead we used a model like:
> 
>               anova y u v w u*v u*w
> 
> then all stat packages will agree on the answer.
> 
> Before I dive into the example let me say the following:
> 
>    1) Not all anova models make sense in the real world of
>       research.  For instance what hypotheses are we trying to
>       test when we give an anova model like -anova y u*v u*w-?
>       Honest statisticians and researchers can give different
>       interpretations to this kind of model.  Who is right?  It
>       really points out a weakness in the modeling language of
>       ANOVA that these disagreements arise.
> 
>    2) <stat package X> has set some arbitrary rules that it
>       uses when presented with one of these "vague" anova model
>       statements.  Stata's output may or may not agree with
>       <stat package X> all of these cases --- but again we come
>       back to asking what is the real question the researcher
>       is trying to answer?  Are any of these strange anova
>       models getting to the "right" answer?  I would say that
>       it is hard to say what is the right answer since the
>       question is vague (i.e., the anova model statement has
>       built in vagueness).  The researcher needs to specify a
>       more exact anova model to state their research question.
>       i.e., give a fuller model and then jointly test those
>       terms that should be jointly tested given the underlying
>       research question.
> 
> This example will illustrate:
> 
> I start by listing the 16 observations I will use:
> 
>       y     u     v     w
> 1.   21     1     1     1
> 2.   56     1     1     1
> 3.   26     1     1     0
> 4.   95     1     1     0
> 5.   28     1     0     1
> 6.   12     1     0     1
> 7.   41     1     0     0
> 8.   72     1     0     0
> 9.   87     0     1     1
> 10.   46     0     1     1
> 11.   42     0     1     0
> 12.   89     0     1     0
> 13.    6     0     0     1
> 14.   68     0     0     1
> 15.   72     0     0     0
> 16.   70     0     0     0
> 
> The y is arbitrary --- I just threw in some numbers so we could
> compare.
> 
> So what does Stata do with this data and a model like -anova y
> u*v u*w-
> 
> . anova y u*v u*w
> 
>              Number of obs =      16     R-squared     =  0.3191
>              Root MSE      = 28.6149     Adj R-squared = -0.0214
> 
>     Source |  Partial SS    df       MS           F     Prob > F
> -----------+----------------------------------------------------
>      Model |   3836.8125     5    767.3625       0.94     0.4972
>            |
>        u*v |    1553.625     3     517.875       0.63     0.6107
>        u*w |    2255.625     2   1127.8125       1.38     0.2962
>            |
>   Residual |    8188.125    10    818.8125
> -----------+----------------------------------------------------
>      Total |  12024.9375    15    801.6625
> 
> and then notice what happens when I switch u*v and u*w around.
> 
> . anova y u*w u*v
> 
>              Number of obs =      16     R-squared     =  0.3191
>              Root MSE      = 28.6149     Adj R-squared = -0.0214
> 
>     Source |  Partial SS    df       MS           F     Prob > F
> -----------+----------------------------------------------------
>      Model |   3836.8125     5    767.3625       0.94     0.4972
>            |
>        u*w |    2800.125     3     933.375       1.14     0.3796
>        u*v |     541.125     2    270.5625       0.33     0.7262
>            |
>   Residual |    8188.125    10    818.8125
> -----------+----------------------------------------------------
>      Total |  12024.9375    15    801.6625
> 
> So Stata decided in one case to give u*v 3 df and u*w only 2 df
> and then in the other case switched it around.  Why would Stata
> do this?  Before I answer that question let me show what <stat
> package X> produces [output edited to just show important part]
> 
>   Source  DF    Type I SS
>   U*V     3     1581.1875
>   U*W     2     2255.625
> 
>   Source  DF    Type III SS
>   U*V     2      541.125
>   U*W     2     2255.625
> 
> Notice that they lose a degree of freedom (and correspondign SS)
> for their type III SS.  Where did that degree of freedom go?
> 
> Here is the answer: With this model it is as if behind the scenes
> they added the term "U" into the model and then compute that
> model instead of the one specified directly by the user.  For
> instance it is exactly like they ran -anova y u u*v u*w- (using
> Stata notation) and then simply delete the line corresponding to
> "u" from the output for the type III sums of squares.
> 
> [output from <stat package X> demonstrating this omitted]
> 
> So what is Stata doing?  It assigns 1 df to the constant.  It
> then sees that it can assign 3 df to U*V.  What is left is 2 dfs
> for U*W.
> 
> [more detailed Stata output omitted -- in particular pointing out
> that -anova, regress- will replay the results and show the
> underlying regression]
> 
> Okay, so if I don't like what Stata does with one of these
> "vague" anova models and I want to get the results that <stat
> package X> gets how would I do it?  We run a more complete model
> and then delete/ignore some of the output lines.
> 
>   . anova y u u*v u*w
> 
> [stata output deleted]
> 
> Notice that from this more complete anova model we can produce
> the results that Stata decided to produce when given the "vague"
> anova model.
> 
>   . test u u*v
> 
> [stata output deleted]
> 
> Likewise we could have asked for  -test u u*w-.
> 
> So what answers are right and what is wrong?  There is no right
> or wrong in this case.  There are lots of ways to interpret
> -anova y u*v u*w-.  We (Stata) decided that dropping the degree
> of freedom (and associated SS) like <stat package X> was not what
> we wanted to do.  Is what we do better? No.  Just different.
> 
> 
> Ken Higbee    [email protected]
> StataCorp     1-800-STATAPC
> 
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
> 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index