Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: Re: st: Order of factors in anova affects results


From   khigbee@stata.com
To   statalist@hsphsun2.harvard.edu
Subject   Re: Re: st: Order of factors in anova affects results
Date   Tue, 30 Sep 2003 08:46:44 -0500

Joseph Coveney <jcoveney@bigplanet.com> correctly answers the
ANOVA question asked by Michael Ingre <Michael.Ingre@ipm.ki.se>.

> Michael's description of the experimental design didn't
> explicitly say so, but the behavior of Stata's -anova- upon
> switching the variable order of the subjects and groups factors
> in the command line indicates that subjects are nested within
> groups.
> 
> If that's so, then the model specification would be groups,
> subjects-within-groups as the error (denominator) mean square for
> groups, then time, and finally the interaction of groups and
> time:
> 
>   -anova dv grp / sub|grp time grp*time-.
> 
> There is a description of the syntax (including what forward
> slashes, pipes and stars are for) in the user's manual for the
> -anova- entry.  There's also a FAQ about repeated-measures anova
> on StataCorp's website,
> www.stata.com/support/faqs/stat/anova2.html .  I think that both
> the user's manual and the FAQ have examples that use datasets
> analogous to Michael's.
> 
> It appears that Michael's dataset has at least one missing value.
> I think that this topic is broached in the user's manual in the
> entry for -anova-.  If there is a concern over the imbalance,
> there are alternatives to consider in this situation, such as
> -xtreg , re-.

I would like to explain how Stata's -anova- handles the factors
that are presented to it.  Assuming the default -partial-
sums-of-squares, then *USUALLY* it makes no difference what order
you specify the terms in the -anova-.

-anova- uses the sweep operator when computing the X'X inverse.
Most ANOVA models are overparameterized, meaning that some rows
and columns of the X'X inverse matrix will be zeroed out (for
instance the X'X matrix might be of dimension 48, but there are
only 20 estimable degrees of freedom in the model).  Which ones
are zeroed out depends on what permutation (the order of the
columns and rows) of the X'X matrix that is presented to the
sweep operation.

What Stata does is place the factors in the X'X matrix with lower
order terms appearing before higher order terms.  This guarantees
that single terms get swept before 2-way interactions which get
swept before 3-way interactions.

So, if I said

    . anova y a*b*c c a a*b b b*c a*c

(a strange way of ordering the terms) then Stata would (behind
the scenes) create the X'X matrix with columns and rows in the
order

    _cons c a b a*b b*c a*c a*b*c

It first takes the constant then each single term in the order
found, then each two-way interaction in the order found, then
each three-way interaction in the order found (only one in this
case).  After computation, Stata would present the ANOVA table
back in the order you specified.

In well specified ANOVA models, it does not matter how you order
the terms.  Whether the single terms appear as

    c a b     or     a b c

or any other permutation does not matter.  The degrees of freedom
for each of these single terms do not overlap.

Assuming Joseph's guess is correct concerning subj being nested
in grp for Michael's ANOVA (and I think it is), then

    anova dv grp subj time time*grp

versus

    anova dv subj grp time time*grp

produces different results for subj and grp because in the first
case Stata first sweeps grp then subject allowing grp to obtain
all of its potential degrees of freedom before subject does.  In
the second case, however, subj is swept first and since the
degrees of freedom for grp are really a subset of the full
potential number of degrees of freedom for subj, subj obtains all
the degrees of freedom leaving none for grp.

Joseph shows that the correct way to specify nesting is with
"subj|grp" which means "subj nested in grp"

With this specification, Stata treats "subj|grp" like a two-way
interaction (i.e., it will be swept after the single terms).
So, even if you did something strange like

    anova dv subj|grp grp ...

instead of

    anova dv grp subj|grp ...

Stata knows that grp gets swept first either way.


I am about to make a long posting even longer.  Bail out now
if not interested.

Many years ago I investigated what appeared to be some strange
results that someone sent in.  They had specified a strange ANOVA
in a couple of different packages.  All the packages reported
something a little different.  Here is a slightly modified
version of the answer I gave them.  I have obscured the identity
of the other packages involved because my intent here is not to
criticize other implementations of ANOVA, only to instruct on
an ambiguity in the modeling language of ANOVA.

... one of the simplest models that demonstrates the differences
is

                anova y u*v u*w

If instead we used a model like:

                anova y u v w u*v u*w

then all stat packages will agree on the answer.

Before I dive into the example let me say the following:

     1) Not all anova models make sense in the real world of
        research.  For instance what hypotheses are we trying to
        test when we give an anova model like -anova y u*v u*w-?
        Honest statisticians and researchers can give different
        interpretations to this kind of model.  Who is right?  It
        really points out a weakness in the modeling language of
        ANOVA that these disagreements arise.

     2) <stat package X> has set some arbitrary rules that it
        uses when presented with one of these "vague" anova model
        statements.  Stata's output may or may not agree with
        <stat package X> all of these cases --- but again we come
        back to asking what is the real question the researcher
        is trying to answer?  Are any of these strange anova
        models getting to the "right" answer?  I would say that
        it is hard to say what is the right answer since the
        question is vague (i.e., the anova model statement has
        built in vagueness).  The researcher needs to specify a
        more exact anova model to state their research question.
        i.e., give a fuller model and then jointly test those
        terms that should be jointly tested given the underlying
        research question.

This example will illustrate:

I start by listing the 16 observations I will use:

        y     u     v     w
  1.   21     1     1     1
  2.   56     1     1     1
  3.   26     1     1     0
  4.   95     1     1     0
  5.   28     1     0     1
  6.   12     1     0     1
  7.   41     1     0     0
  8.   72     1     0     0
  9.   87     0     1     1
 10.   46     0     1     1
 11.   42     0     1     0
 12.   89     0     1     0
 13.    6     0     0     1
 14.   68     0     0     1
 15.   72     0     0     0
 16.   70     0     0     0

The y is arbitrary --- I just threw in some numbers so we could
compare.

So what does Stata do with this data and a model like -anova y
u*v u*w-

. anova y u*v u*w

               Number of obs =      16     R-squared     =  0.3191
               Root MSE      = 28.6149     Adj R-squared = -0.0214

      Source |  Partial SS    df       MS           F     Prob > F
  -----------+----------------------------------------------------
       Model |   3836.8125     5    767.3625       0.94     0.4972
             |
         u*v |    1553.625     3     517.875       0.63     0.6107
         u*w |    2255.625     2   1127.8125       1.38     0.2962
             |
    Residual |    8188.125    10    818.8125
  -----------+----------------------------------------------------
       Total |  12024.9375    15    801.6625

and then notice what happens when I switch u*v and u*w around.

. anova y u*w u*v

               Number of obs =      16     R-squared     =  0.3191
               Root MSE      = 28.6149     Adj R-squared = -0.0214

      Source |  Partial SS    df       MS           F     Prob > F
  -----------+----------------------------------------------------
       Model |   3836.8125     5    767.3625       0.94     0.4972
             |
         u*w |    2800.125     3     933.375       1.14     0.3796
         u*v |     541.125     2    270.5625       0.33     0.7262
             |
    Residual |    8188.125    10    818.8125
  -----------+----------------------------------------------------
       Total |  12024.9375    15    801.6625

So Stata decided in one case to give u*v 3 df and u*w only 2 df
and then in the other case switched it around.  Why would Stata
do this?  Before I answer that question let me show what <stat
package X> produces [output edited to just show important part]

    Source  DF    Type I SS
    U*V     3     1581.1875
    U*W     2     2255.625
 
    Source  DF    Type III SS
    U*V     2      541.125
    U*W     2     2255.625
 
Notice that they lose a degree of freedom (and correspondign SS)
for their type III SS.  Where did that degree of freedom go?

Here is the answer: With this model it is as if behind the scenes
they added the term "U" into the model and then compute that
model instead of the one specified directly by the user.  For
instance it is exactly like they ran -anova y u u*v u*w- (using
Stata notation) and then simply delete the line corresponding to
"u" from the output for the type III sums of squares.

[output from <stat package X> demonstrating this omitted]

So what is Stata doing?  It assigns 1 df to the constant.  It
then sees that it can assign 3 df to U*V.  What is left is 2 dfs
for U*W.

[more detailed Stata output omitted -- in particular pointing out
that -anova, regress- will replay the results and show the
underlying regression]

Okay, so if I don't like what Stata does with one of these
"vague" anova models and I want to get the results that <stat
package X> gets how would I do it?  We run a more complete model
and then delete/ignore some of the output lines.

    . anova y u u*v u*w

[stata output deleted]

Notice that from this more complete anova model we can produce
the results that Stata decided to produce when given the "vague"
anova model.

    . test u u*v

[stata output deleted]

Likewise we could have asked for  -test u u*w-.

So what answers are right and what is wrong?  There is no right
or wrong in this case.  There are lots of ways to interpret
-anova y u*v u*w-.  We (Stata) decided that dropping the degree
of freedom (and associated SS) like <stat package X> was not what
we wanted to do.  Is what we do better? No.  Just different.


Ken Higbee    khigbee@stata.com
StataCorp     1-800-STATAPC

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2022 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index