|
This FAQ is for Stata 11. In Stata 12, you can use
contrast
and pwcompare to
compare the levels of categorical variables.
How can I form various tests comparing the different levels of a
categorical variable after anova or regress?
|
Title
|
|
Tests comparing levels of a categorical variable after anova or regress
|
|
Author
|
Kenneth Higbee, StataCorp
Wesley Eddings, StataCorp
|
|
Date
|
October 1999; updated August 2009
|
Introduction
Often researchers want to test for differences between levels of a factor
(categorical variable) or factors after running an
anova or
regress
command. For instance, with one factor the questions might be
- Is level one different from level two?
- Is level one different from level three?
There are often other kinds of tests between levels of a factor that are
also of interest. For instance, the questions might be of the form:
- Is level one different from the average of levels two through four?
- Is level two different from the average of levels three and four?
- Is level three different from level four?
Many other interesting questions like these can be asked after an estimation
command involving a categorical variable (factor).
The test
command is one tool to use in answering these questions. There are several
variations of the syntax for test depending on if you wish to test
coefficients, expressions, terms (after anova), or to test several
coefficients at the same time. One form of the test command that we
will use is
test [exp = exp] [, accumulate notest ]
Details can be found in the Base Reference Manual ([R] test and in
the case of anova the sections in [R] anova postestimation
that discuss testing).
One categorical factor
How you parameterize your model makes a difference in what you do to perform
your tests. We will examine a couple of ways of parameterizing a simple
one-way ANOVA model. We will use the following three approaches:
- regress with the noconstant option
- regress leaving the constant in the model
- anova
With each of these approaches, we will show how to use the test
command to obtain tests of interest. In particular, we will
- test level one against level two;
- test the average of levels one and two against level three; and
- test that five times level one plus four times level two minus three
times level three equals two times level four (a strange linear
combination just to show that it can be done).
Here is a 20-observation dataset with two variables—the outcome y
and a categorical variable x with four levels.
. list, noobs sepby(x)
+-------+
| x y |
|-------|
| 1 7 |
| 1 5 |
| 1 3 |
| 1 4 |
| 1 3 |
|-------|
| 2 5 |
| 2 3 |
| 2 5 |
| 2 3 |
| 2 1 |
|-------|
| 3 6 |
| 3 8 |
| 3 6 |
| 3 4 |
| 3 4 |
|-------|
| 4 5 |
| 4 8 |
| 4 6 |
| 4 8 |
| 4 5 |
+-------+
. table x, c(mean y)
----------------------
x | mean(y)
----------+-----------
1 | 4.4
2 | 3.4
3 | 5.6
4 | 6.4
----------------------
Regression excluding the intercept
Here is the regression excluding the intercept. We specify ibn.x so
there will be no base category: all four levels of x will be
included in the model. (See
help fvvarlist.)
. regress y ibn.x, noconstant
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 4, 16) = 48.24
Model | 516.2 4 129.05 Prob > F = 0.0000
Residual | 42.8 16 2.675 R-squared = 0.9234
-------------+------------------------------ Adj R-squared = 0.9043
Total | 559 20 27.95 Root MSE = 1.6355
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x |
1 | 4.4 .7314369 6.02 0.000 2.849423 5.950577
2 | 3.4 .7314369 4.65 0.000 1.849423 4.950577
3 | 5.6 .7314369 7.66 0.000 4.049423 7.150577
4 | 6.4 .7314369 8.75 0.000 4.849423 7.950577
------------------------------------------------------------------------------
The coefficients agree with the means reported by table. These
coefficients are easily interpreted and easily tested. Here are the three
tests after this regression:
- Test level one against level two
. test i1.x == i2.x
( 1) 1bn.x - 2.x = 0
F( 1, 16) = 0.93
Prob > F = 0.3481
- Test the average of levels one and two against level three
. test 0.5*i1.x + 0.5*i2.x == i3.x
( 1) .5*1bn.x + .5*2.x - 3.x = 0
F( 1, 16) = 3.60
Prob > F = 0.0759
- Test that five times level one plus four times level two minus three
times level three equals two times level four
. test 5*i1.x + 4*i2.x - 3*i3.x == 2*i4.x
( 1) 5*1bn.x + 4*2.x - 3*3.x - 2*4.x = 0
F( 1, 16) = 1.25
Prob > F = 0.2808
Any of several other strange and/or wonderful tests could be performed. If
you wish to test a nonlinear expression, you will want to look at
testnl (see
[R] testnl). If you want to jointly test two or more of these single
degree-of-freedom tests, you can use the accumulate option of
test. Another
useful command is
lincom (see
[R] lincom). It can also be used to test many of the same hypotheses
as the test command and has the benefit of providing not only the
test result but also the estimate of the linear combination and the standard
error of the estimate along with confidence intervals. Just as an example,
here is the third test done using lincom.
. lincom 5*i1.x + 4*i2.x - 3*i3.x - 2*i4.x
( 1) 5*1bn.x + 4*2.x - 3*3.x - 2*4.x = 0
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 6 5.374942 1.12 0.281 -5.394368 17.39437
------------------------------------------------------------------------------
Regression with the intercept
To fit a model with a constant, we need to remove the noconstant option from
our regress
command. We also need to specify a base category, because we cannot
include all four levels of x and an intercept, too. It would be
possible to specify any level of x as the base category, but
we will just change ibn.x to i.x and let Stata select the
first level of x as the default base.
. regress y i.x
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 3, 16) = 3.26
Model | 26.15 3 8.71666667 Prob > F = 0.0492
Residual | 42.8 16 2.675 R-squared = 0.3793
-------------+------------------------------ Adj R-squared = 0.2629
Total | 68.95 19 3.62894737 Root MSE = 1.6355
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x |
2 | -1 1.034408 -0.97 0.348 -3.192847 1.192847
3 | 1.2 1.034408 1.16 0.263 -.9928471 3.392847
4 | 2 1.034408 1.93 0.071 -.1928471 4.192847
|
_cons | 4.4 .7314369 6.02 0.000 2.849423 5.950577
------------------------------------------------------------------------------
If you compare the coefficients from this regression with the output from the
regression without a constant (and to the table showing the mean for each
category), you see that the mean of the base category (one) corresponds
to the coefficient for the constant, and if you add the coefficient for the
constant to the other category coefficients, you get the mean for those
categories. The coefficients for the included levels are relative to the
base level.
This result provides a clue on how to obtain our three example tests of
interest when the constant is in the model. It is a matter of doing some
simple algebra to take a test from the first regression model and produce an
equivalent test in this regression model. For reference, here is the
correspondence between the two regression models:
| No constant | With constant |
| i1.x | _cons |
| i2.x | _cons + i2.x |
i3.x | _cons + i3.x |
i4.x | _cons + i4.x |
After our first regression (without a constant), the first test was test
i1.x == i2.x. After this latest regression (with a constant), the same
test is test _cons = _cons + i2.x. This test can be simplified to
test i2.x. The equivalent second and third tests can be similarly
determined. Here are the three tests after regress with the constant
included:
- Test level one against level two
. test i2.x
( 1) 2.x = 0
F( 1, 16) = 0.93
Prob > F = 0.3481
- Test the average of levels one and two against level three
. test 0.5*i2.x == i3.x
( 1) .5*2.x - 3.x = 0
F( 1, 16) = 3.60
Prob > F = 0.0759
- Test that five times level one plus four times level two minus three
times level three equals two times level four
. test 4*_cons + 4*i2.x - 3*i3.x == 2*i4.x
( 1) 4*2.x - 3*3.x - 2*4.x + 4*_cons = 0
F( 1, 16) = 1.25
Prob > F = 0.280811
When you have the constant in the model (and in more complicated designs),
it is important to understand how to interpret the coefficients from the
regression model so that you can form the correct tests. The formation of
the test above is not very intuitive but becomes clearer after doing
some algebra starting with your understanding of the meaning of each
coefficient and how they relate to the quantities you wish to test.
ANOVA
The anova command is a natural choice for analyzing this same dataset.
. anova y x
Number of obs = 20 R-squared = 0.3793
Root MSE = 1.63554 Adj R-squared = 0.2629
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 26.15 3 8.71666667 3.26 0.0492
|
x | 26.15 3 8.71666667 3.26 0.0492
|
Residual | 42.8 16 2.675
-----------+----------------------------------------------------
Total | 68.95 19 3.62894737
We will examine obtaining individual degrees-of-freedom tests shortly.
First, we will take a look at the underlying regression for this
anova.
. regress
Source | SS df MS Number of obs = 20
-------------+------------------------------ F( 3, 16) = 3.26
Model | 26.15 3 8.71666667 Prob > F = 0.0492
Residual | 42.8 16 2.675 R-squared = 0.3793
-------------+------------------------------ Adj R-squared = 0.2629
Total | 68.95 19 3.62894737 Root MSE = 1.6355
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x |
2 | -1 1.034408 -0.97 0.348 -3.192847 1.192847
3 | 1.2 1.034408 1.16 0.263 -.9928471 3.392847
4 | 2 1.034408 1.93 0.071 -.1928471 4.192847
|
_cons | 4.4 .7314369 6.02 0.000 2.849423 5.950577
------------------------------------------------------------------------------
anova used the first level of x as the base category, so the
output matches the output of our command regress y i.x. All the
same test commands will work after anova just as they worked after
regress. The lincom command can also be used after
anova.
The test() option of test after anova provides a
convenient shorthand for specifying these kinds of tests using a matrix to
denote the equation. test, showorder lists the order of the terms,
and hence, the order of columns for the matrix.
. test, showorder
Order of columns in the design matrix
1: (x==1)
2: (x==2)
3: (x==3)
4: (x==4)
5: _cons
Here are the same three tests using the test() option.
- Test level one against level two
. mat c1 = (0,1,0,0,0)
. test, test(c1)
( 1) 2.x = 0
F( 1, 16) = 0.93
Prob > F = 0.3481
- Test the average of levels one and two against level three
. mat c2 = (0,0.5,-1,0,0)
. test, test(c2)
( 1) .5*2.x - 3.x = 0
F( 1, 16) = 3.60
Prob > F = 0.0759
- Test that five times level one plus four times level two minus three
times level three equals two times level four
. mat c3 = (0,4,-3,-2,4)
. test, test(c3)
( 1) 4*2.x - 3*3.x - 2*4.x + 4*_cons = 0
F( 1, 16) = 1.25
Prob > F = 0.280811
Two categorical factors
When there are two (or more) categorical factors in our model, we again may
want to test various single degrees-of-freedom hypotheses that compare
various levels of the two (or more) factors in the model. As with the
one-way ANOVA model, how you parameterize your two-way (or higher) model
affects how you go about performing individual tests.
To demonstrate how to obtain single degrees-of-freedom tests after a two-way
ANOVA, we will use the following 24-observation dataset where the variables
a and b are categorical variables with 4 and 3 levels,
respectively, and there is a response variable, y.
. list, noobs sepby(a)
+------------+
| a b y |
|------------|
| 1 1 26 |
| 1 1 30 |
| 1 2 54 |
| 1 2 50 |
| 1 3 34 |
| 1 3 46 |
|------------|
| 2 1 16 |
| 2 1 20 |
| 2 2 36 |
| 2 2 24 |
| 2 3 50 |
| 2 3 34 |
|------------|
| 3 1 48 |
| 3 1 28 |
| 3 2 28 |
| 3 2 28 |
| 3 3 50 |
| 3 3 46 |
|------------|
| 4 1 50 |
| 4 1 46 |
| 4 2 48 |
| 4 2 44 |
| 4 3 48 |
| 4 3 28 |
+------------+
The following table shows the mean of y for each cell of a by
b as well as the means for each level of a and b (the
column and row titled “Total”):
. table a b, c(mean y) row col
--------------------------------------
| b
a | 1 2 3 Total
----------+---------------------------
1 | 28 52 40 40
2 | 18 30 42 30
3 | 38 28 48 38
4 | 48 46 38 44
|
Total | 33 39 42 38
--------------------------------------
This dataset is balanced (two observations per cell). If you are dealing
with unbalanced data (including the case where you have missing cells) you
will want to also read the technical notes in the section titled Two-way
ANOVA in the [R] anova manual entry.
The standard way of performing an ANOVA on this dataset is with
. anova y a b a#b
Number of obs = 24 R-squared = 0.7606
Root MSE = 7.74597 Adj R-squared = 0.5412
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 2288 11 208 3.47 0.0214
|
a | 624 3 208 3.47 0.0509
b | 336 2 168 2.80 0.1005
a#b | 1328 6 221.333333 3.69 0.0259
|
Residual | 720 12 60
-----------+----------------------------------------------------
Total | 3008 23 130.782609
This is the overparameterized two-way ANOVA model (which we will discuss in
more detail later). When it comes to single degrees-of-freedom tests, some
people prefer to use a different parameterization—the cell means
model.
Cell means ANOVA model
In the cell means ANOVA model, we first create one categorical variable that
corresponds to the cells in the two-way table (or higher-order table if more
than two categorical variables are involved). The
egen
group() function is useful for creating the single categorical
variable.
. egen c = group(a b)
. table a b, c(mean c)
----------------------------
| b
a | 1 2 3
----------+-----------------
1 | 1 2 3
2 | 4 5 6
3 | 7 8 9
4 | 10 11 12
----------------------------
The table above reminds us how the c variable relates to the original
a and b variables. For instance, when c is 8, it means
that a is 3 and b is 2. (If a and b had been
reversed in the egen group() option, then the table above
would show a different relationship.)
The cell means ANOVA model is then obtained by using the noconstant
option of anova and the newly created c variable in place of
a and b.
(You could also fit the model by typing regress y ibn.c, noconstant.)
. anova y ibn.c, noconstant
Number of obs = 24 R-squared = 0.9809
Root MSE = 7.74597 Adj R-squared = 0.9618
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 36944 12 3078.66667 51.31 0.0000
|
c | 36944 12 3078.66667 51.31 0.0000
|
Residual | 720 12 60
-----------+----------------------------------------------------
Total | 37664 24 1569.33333
. regress
Source | SS df MS Number of obs = 24
-------------+------------------------------ F( 12, 12) = 51.31
Model | 36944 12 3078.66667 Prob > F = 0.0000
Residual | 720 12 60 R-squared = 0.9809
-------------+------------------------------ Adj R-squared = 0.9618
Total | 37664 24 1569.33333 Root MSE = 7.746
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
c |
1 | 28 5.477226 5.11 0.000 16.06615 39.93385
2 | 52 5.477226 9.49 0.000 40.06615 63.93385
3 | 40 5.477226 7.30 0.000 28.06615 51.93385
4 | 18 5.477226 3.29 0.007 6.066151 29.93385
5 | 30 5.477226 5.48 0.000 18.06615 41.93385
6 | 42 5.477226 7.67 0.000 30.06615 53.93385
7 | 38 5.477226 6.94 0.000 26.06615 49.93385
8 | 28 5.477226 5.11 0.000 16.06615 39.93385
9 | 48 5.477226 8.76 0.000 36.06615 59.93385
10 | 48 5.477226 8.76 0.000 36.06615 59.93385
11 | 46 5.477226 8.40 0.000 34.06615 57.93385
12 | 38 5.477226 6.94 0.000 26.06615 49.93385
------------------------------------------------------------------------------
Compare the 12 coefficients for c in the table above with the table
of means presented earlier. The coefficients from the cell means ANOVA
model are the cell means from the two-way table. This correspondence makes
creating meaningful test statements easy.
You can recreate an F test from the overparameterized ANOVA model by
using appropriate combinations of single degree-of-freedom tests after the
cell means ANOVA model. For example, the test for the term a with 3
degrees of freedom can be obtained by accumulating 3 single
degree-of-freedom tests. Below I combine the tests of level 1 versus 2,
level 1 versus 3, and level 1 versus 4 of the a variable. (Remember
that c 1, 2, and 3 correspond to level 1 of a; c 4, 5,
and 6 corresponds to level 2 of a; and so on.)
. mat amat = (1,1,1,-1,-1,-1,0,0,0,0,0,0 \
> 1,1,1,0,0,0,-1,-1,-1,0,0,0 \
> 1,1,1,0,0,0,0,0,0,-1,-1,-1)
. test, test(amat)
( 1) 1bn.c + 2.c + 3.c - 4.c - 5.c - 6.c = 0
( 2) 1bn.c + 2.c + 3.c - 7.c - 8.c - 9.c = 0
( 3) 1bn.c + 2.c + 3.c - 10.c - 11.c - 12.c = 0
F( 3, 12) = 3.47
Prob > F = 0.0509
This F of 3.47 agrees with the F test for the term a in
the overparameterized ANOVA presented earlier. Of course, it is easier to
obtain the test of a term like a by running the overparameterized
model. I just wanted to show that you could also obtain the result with a
little work starting from the cell means model.
For the sake of completeness, here is the test command to produce the
F test for the b term.
. mat bmat = (1,-1,0,1,-1,0,1,-1,0,1,-1,0 \
> 1,0,-1,1,0,-1,1,0,-1,1,0,-1)
. test, test(bmat)
( 1) 1bn.c - 2.c + 4.c - 5.c + 7.c - 8.c + 10.c - 11.c = 0
( 2) 1bn.c - 3.c + 4.c - 6.c + 7.c - 9.c + 10.c - 12.c = 0
F( 2, 12) = 2.80
Prob > F = 0.1005
Here is the test command to produce the F test for the
a by b interaction term.
. forvalues i = 1/3 {
2. forvalues j = 1/2 {
3. mat abmat = nullmat(abmat) \ vecdiag(amat[`i',1...]'*bmat[`j',1...])
4. }
5. }
. mat list abmat
abmat[6,12]
c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12
r1 1 -1 0 -1 1 0 0 0 0 0 0 0
r1 1 0 -1 -1 0 1 0 0 0 0 0 0
r1 1 -1 0 0 0 0 -1 1 0 0 0 0
r1 1 0 -1 0 0 0 -1 0 1 0 0 0
r1 1 -1 0 0 0 0 0 0 0 -1 1 0
r1 1 0 -1 0 0 0 0 0 0 -1 0 1
. test, test(abmat)
( 1) 1bn.c - 2.c - 4.c + 5.c = 0
( 2) 1bn.c - 3.c - 4.c + 6.c = 0
( 3) 1bn.c - 2.c - 7.c + 8.c = 0
( 4) 1bn.c - 3.c - 7.c + 9.c = 0
( 5) 1bn.c - 2.c - 10.c + 11.c = 0
( 6) 1bn.c - 3.c - 10.c + 12.c = 0
F( 6, 12) = 3.69
Prob > F = 0.0259
There are actually many different ways I could have obtained the
overall F tests for the a, b, and a by b
terms.
Now that we have demonstrated that you can reproduce the results from the
overparameterized model with an appropriate series of test statements
after a cell means model, let us now look at a few different single
degrees-of-freedom tests. (We will later see how to obtain these same single
degrees-of-freedom tests after the overparameterized ANOVA.) You will want
to look back at the table showing how c
relates to a and b to see how these tests were constructed.
- Test level two against level four of factor a
. test i4.c + i5.c + i6.c = i10.c + i11.c + i12.c
( 1) 4.c + 5.c + 6.c - 10.c - 11.c - 12.c = 0
F( 1, 12) = 9.80
Prob > F = 0.0087
- Test the average of levels one and two against level three of factor
b
. test (i1.c + i4.c + i7.c + i10.c + i2.c + i5.c + i8.c + i11.c)/2
> = i3.c + i6.c + i9.c + i12.c
( 1) .5*1bn.c + .5*2.c - 3.c + .5*4.c + .5*5.c - 6.c + .5*7.c + .5*8.c - 9.c +
.5*10.c + .5*11.c - 12.c = 0
F( 1, 12) = 3.20
Prob > F = 0.0989
- Test the average of levels one and two of a, when b is
also at level one or two, against the average of levels three and four
of a, when b is at level three
. test (i1.c + i4.c + i2.c + i5.c)/2 = i9.c + i12.c
( 1) .5*1bn.c + .5*2.c + .5*4.c + .5*5.c - 9.c - 12.c = 0
F( 1, 12) = 5.38
Prob > F = 0.0388
- Test that three times a at one and b at one, minus four
times a at three and b at two, plus six times a at
four and b at three, equals a at two and b at two,
minus two times a at two and b at three
. test 3*i1.c - 4*i8.c + 6*i12.c = i5.c - 2*i6.c
( 1) 3*1bn.c - 5.c + 2*6.c - 4*8.c + 6*12.c = 0
F( 1, 12) = 32.58
Prob > F = 0.0001
Those same four tests using the test() option are
-
. mat m1 = (0,0,0,1,1,1,0,0,0,-1,-1,-1)
. test, test(m1)
( 1) 4.c + 5.c + 6.c - 10.c - 11.c - 12.c = 0
F( 1, 12) = 9.80
Prob > F = 0.0087
-
. mat m2 = (.5,.5,-1,.5,.5,-1,.5,.5,-1,.5,.5,-1)
. test, test(m2)
( 1) .5*1bn.c + .5*2.c - 3.c + .5*4.c + .5*5.c - 6.c + .5*7.c + .5*8.c - 9.c +
.5*10.c + .5*11.c - 12.c = 0
F( 1, 12) = 3.20
Prob > F = 0.0989
-
. mat m3 = (.5,.5,0,.5,.5,0,0,0,-1,0,0,-1)
. test, test(m3)
( 1) .5*1bn.c + .5*2.c + .5*4.c + .5*5.c - 9.c - 12.c = 0
F( 1, 12) = 5.38
Prob > F = 0.0388
-
. mat m4 = (3,0,0,0,-1,2,0,-4,0,0,0,6)
. test, test(m4)
( 1) 3*1bn.c - 5.c + 2*6.c - 4*8.c + 6*12.c = 0
F( 1, 12) = 32.58
Prob > F = 0.0001
Constructing various single degrees-of-freedom tests after a cell means ANOVA
model is relatively easy. You pick the appropriate linear combination of
the coefficients based on how the single categorical variable (c in
our example) relates to the original categorical variables (a and
b in our example) and based on the hypothesis of interest.
Overparameterized ANOVA model
Most people are used to the results presented by the overparameterized ANOVA
model. As we saw when we discussed the cell means ANOVA model, the F
tests for terms in the ANOVA model are obtained directly from the
overparameterized model ANOVA table. Compare this with computing an
F test for a term after the cell means ANOVA model. However, when it
comes to obtaining single degrees-of-freedom tests, most people find the cell
means model approach to be the easiest.
Here again is the overparameterized ANOVA model for our example data. Also,
I use the regress command to replay the ANOVA as a regression table.
. anova y a b a#b
Number of obs = 24 R-squared = 0.7606
Root MSE = 7.74597 Adj R-squared = 0.5412
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 2288 11 208 3.47 0.0214
|
a | 624 3 208 3.47 0.0509
b | 336 2 168 2.80 0.1005
a#b | 1328 6 221.333333 3.69 0.0259
|
Residual | 720 12 60
-----------+----------------------------------------------------
Total | 3008 23 130.782609
. regress, noheader
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
a |
2 | -10 7.745967 -1.29 0.221 -26.87701 6.877012
3 | 10 7.745967 1.29 0.221 -6.877012 26.87701
4 | 20 7.745967 2.58 0.024 3.122988 36.87701
|
b |
2 | 24 7.745967 3.10 0.009 7.122988 40.87701
3 | 12 7.745967 1.55 0.147 -4.877012 28.87701
|
a#b |
2 2 | -12 10.95445 -1.10 0.295 -35.8677 11.8677
2 3 | 12 10.95445 1.10 0.295 -11.8677 35.8677
3 2 | -34 10.95445 -3.10 0.009 -57.8677 -10.1323
3 3 | -2 10.95445 -0.18 0.858 -25.8677 21.8677
4 2 | -26 10.95445 -2.37 0.035 -49.8677 -2.132301
4 3 | -22 10.95445 -2.01 0.068 -45.8677 1.867699
|
_cons | 28 5.477226 5.11 0.000 16.06615 39.93385
------------------------------------------------------------------------------
Now the important question is how the coefficients in this model relate to
the cell means. To refresh your memory, here is the table of cell means
(and marginal means).
. table a b, c(mean y) row col
--------------------------------------
| b
a | 1 2 3 Total
----------+---------------------------
1 | 28 52 40 40
2 | 18 30 42 30
3 | 38 28 48 38
4 | 48 46 38 44
|
Total | 33 39 42 38
--------------------------------------
The cell mean for level i of a and level j of b is equal to
the coefficient for the constant plus the coefficient for a at level
i, plus the coefficient for b at level j, plus the coefficient for
a and b at i and j. When a coefficient is omitted from the
regression table, the corresponding coefficient is zero. The table below
shows the relationship.
| a |
b |
Cell mean |
Cell mean (simplified) |
| a = 1 |
b = 1 |
_cons+i1.a+i1.b+i1.a#i1.b |
_cons |
| a = 1 |
b = 2 |
_cons+i1.a+i2.b+i1.a#i2.b |
_cons+i2.b |
| a = 1 |
b = 3 |
_cons+i1.a+i3.b+i1.a#i3.b |
_cons+i3.b |
| a = 2 |
b = 1 |
_cons+i2.a+i1.b+i2.a#i1.b |
_cons+i2.a |
| a = 2 |
b = 2 |
_cons+i2.a+i2.b+i2.a#i2.b |
_cons+i2.a+i2.b+i2.a#i2.b |
| a = 2 |
b = 3 |
_cons+i2.a+i3.b+i2.a#i3.b |
_cons+i2.a+i3.b+i2.a#i3.b |
| a = 3 |
b = 1 |
_cons+i3.a+i1.b+i3.a#i1.b |
_cons+i3.a |
| a = 3 |
b = 2 |
_cons+i3.a+i2.b+i3.a#i2.b |
_cons+i3.a+i2.b+i3.a#i2.b |
| a = 3 |
b = 3 |
_cons+i3.a+i3.b+i3.a#i3.b |
_cons+i3.a+i3.b+i3.a#i3.b |
| a = 4 |
b = 1 |
_cons+i4.a+i1.b+i4.a#i1.b |
_cons+i4.a |
| a = 4 |
b = 2 |
_cons+i4.a+i2.b+i4.a#i2.b |
_cons+i4.a+i2.b+i4.a#i2.b |
| a = 4 |
b = 3 |
_cons+i4.a+i3.b+i4.a#i3.b |
_cons+i4.a+i3.b+i4.a#i3.b |
The simplifications shown at the far right of the table are due to the
coefficients that are omitted from the overparameterized model being zero.
The marginal means can easily be built up by averaging appropriate cell
means together.
We can obtain the same four single degrees-of-freedom tests as were obtained
with the cell means ANOVA model by examining the relationship (shown in the
table above) between the coefficients of the overparameterized model and the
quantities of real interest—the cell means. We could simply plug in
all the coefficients for each cell involved in the test and let
Stata’s test command do the algebra, or we can do the
simplifying ourselves. For the same four tests that were performed for the
cell means model, I will show the results when you plug everything into
test (based on the simplification in the far right column of the
table above) and let Stata’s test command do the algebra. (It
also would work if I plugged the unsimplified cell mean expressions into
test.)
- Test level two against level four of factor a
. test (_cons+i2.a)
> + (_cons+i2.a+i2.b+i2.a#i2.b)
> + (_cons+i2.a+i3.b+i2.a#i3.b)
> = (_cons+i4.a)
> + (_cons+i4.a+i2.b+i4.a#i2.b)
> + (_cons+i4.a+i3.b+i4.a#i3.b)
( 1) 3*2.a - 3*4.a + 2.a#2.b + 2.a#3.b - 4.a#2.b - 4.a#3.b = 0
F( 1, 12) = 9.80
Prob > F = 0.0087
- Test the average of levels one and two against level three of factor b
. test ((_cons)
> + (_cons+i2.a)
> + (_cons+i3.a)
> + (_cons+i4.a)
> + (_cons+i2.b)
> + (_cons+i2.a+i2.b+i2.a#i2.b)
> + (_cons+i3.a+i2.b+i3.a#i2.b)
> + (_cons+i4.a+i2.b+i4.a#i2.b))
> /2
> = (_cons+i3.b)
> + (_cons+i2.a+i3.b+i2.a#i3.b)
> + (_cons+i3.a+i3.b+i3.a#i3.b)
> + (_cons+i4.a+i3.b+i4.a#i3.b)
( 1) 2*2.b - 4*3.b + .5*2.a#2.b - 2.a#3.b + .5*3.a#2.b - 3.a#3.b + .5*4.a#2.b
- 4.a#3.b = 0
F( 1, 12) = 3.20
Prob > F = 0.0989
- Test the average of levels one and two of a, when b is
also at level one or two, against the average of levels three and four
of a, when b is at level three
. test ((_cons)
> + (_cons+i2.a)
> + (_cons+i2.b)
> + (_cons+i2.a+i2.b+i2.a#i2.b))/2
> = (_cons+i3.a+i3.b+i3.a#i3.b)
> + (_cons+i4.a+i3.b+i4.a#i3.b)
( 1) 2.a - 3.a - 4.a + 2.b - 2*3.b + .5*2.a#2.b - 3.a#3.b - 4.a#3.b = 0
F( 1, 12) = 5.38
Prob > F = 0.0388
- Test that three times a at one and b at one, minus four
times a at three and b at two, plus six times a at
four and b at three, equals a at two and b at two,
minus two times a at two and b at three
. test 3*(_cons)
> - 4*(_cons+i3.a+i2.b+i3.a#i2.b)
> + 6*(_cons+i4.a+i3.b+i4.a#i3.b)
> = (_cons+i2.a+i2.b+i2.a#i2.b)
> - 2*(_cons+i2.a+i3.b+i2.a#i3.b)
( 1) 2.a - 4*3.a + 6*4.a - 5*2.b + 8*3.b - 2.a#2.b + 2*2.a#3.b - 4*3.a#2.b +
6*4.a#3.b + 6*_cons = 0
F( 1, 12) = 32.58
Prob > F = 0.0001
You can compare these results with those obtained after the cell means ANOVA
model to see that they are the same.
The test() option of test may also be used. First we check
the order of the columns.
. test, showorder
Order of columns in the design matrix
1: (a==1)
2: (a==2)
3: (a==3)
4: (a==4)
5: (b==1)
6: (b==2)
7: (b==3)
8: (a==1)*(b==1)
9: (a==1)*(b==2)
10: (a==1)*(b==3)
11: (a==2)*(b==1)
12: (a==2)*(b==2)
13: (a==2)*(b==3)
14: (a==3)*(b==1)
15: (a==3)*(b==2)
16: (a==3)*(b==3)
17: (a==4)*(b==1)
18: (a==4)*(b==2)
19: (a==4)*(b==3)
20: _cons
Now here are the same four tests.
-
. mat x1 = (0,3,0,-3,0,0,0,0,0,0,1,1,1,0,0,0,-1,-1,-1,0)
. test, test(x1)
( 1) 3*2.a - 3*4.a + 2o.a#1b.b + 2.a#2.b + 2.a#3.b - 4o.a#1b.b - 4.a#2.b -
4.a#3.b = 0
F( 1, 12) = 9.80
Prob > F = 0.0087
-
. mat x2 = (0,0,0,0,2,2,-4,.5,.5,-1,.5,.5,-1,.5,.5,-1,.5,.5,-1,0)
. test, test(x2)
( 1) 2*1b.b + 2*2.b - 4*3.b + .5*1b.a#1b.b + .5*1b.a#2o.b - 1b.a#3o.b +
.5*2o.a#1b.b + .5*2.a#2.b - 2.a#3.b + .5*3o.a#1b.b + .5*3.a#2.b - 3.a#3.b
+ .5*4o.a#1b.b + .5*4.a#2.b - 4.a#3.b = 0
F( 1, 12) = 3.20
Prob > F = 0.0989
-
. mat x3 = (1,1,-1,-1,1,1,-2,.5,.5,0,.5,.5,0,0,0,-1,0,0,-1,0)
. test, test(x3)
( 1) 1b.a + 2.a - 3.a - 4.a + 1b.b + 2.b - 2*3.b + .5*1b.a#1b.b + .5*1b.a#2o.b
+ .5*2o.a#1b.b + .5*2.a#2.b - 3.a#3.b - 4.a#3.b = 0
F( 1, 12) = 5.38
Prob > F = 0.0388
-
. mat x4 = (3,1,-4,6,3,-5,8,3,0,0,0,-1,2,0,-4,0,0,0,6,6)
. test, test(x4)
( 1) 3*1b.a + 2.a - 4*3.a + 6*4.a + 3*1b.b - 5*2.b + 8*3.b + 3*1b.a#1b.b -
2.a#2.b + 2*2.a#3.b - 4*3.a#2.b + 6*4.a#3.b + 6*_cons = 0
F( 1, 12) = 32.58
Prob > F = 0.0001
How did we come up with matrices x1–x4? Let me
illustrate with x4.
| column |
definition |
3*a[1]b[1] |
-4*a[3]b[2] |
6*a[4]b[3] |
-1*a[2]b[2] |
2*a[2]b[3] |
Sum |
| 1 |
a==1 |
3 |
0 |
0 |
0 |
0 |
3 |
| 2 |
a==2 |
0 |
0 |
0 |
-1 |
2 |
1 |
| 3 |
a==3 |
0 |
-4 |
0 |
0 |
0 |
-4 |
| 4 |
a==4 |
0 |
0 |
6 |
0 |
0 |
6 |
| 5 |
b==1 |
3 |
0 |
0 |
0 |
0 |
3 |
| 6 |
b==2 |
0 |
-4 |
0 |
-1 |
0 |
-5 |
| 7 |
b==3 |
0 |
0 |
6 |
0 |
2 |
8 |
| 8 |
a==1,b==1 |
3 |
0 |
0 |
0 |
0 |
3 |
| 9 |
a==1,b==2 |
0 |
0 |
0 |
0 |
0 |
0 |
| 10 |
a==1,b==3 |
0 |
0 |
0 |
0 |
0 |
0 |
| 11 |
a==2,b==1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 12 |
a==2,b==2 |
0 |
0 |
0 |
-1 |
0 |
-1 |
| 13 |
a==2,b==3 |
0 |
0 |
0 |
0 |
2 |
2 |
| 14 |
a==3,b==1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 15 |
a==3,b==2 |
0 |
-4 |
0 |
0 |
0 |
-4 |
| 16 |
a==3,b==3 |
0 |
0 |
0 |
0 |
0 |
0 |
| 17 |
a==4,b==1 |
0 |
0 |
0 |
0 |
0 |
0 |
| 18 |
a==4,b==2 |
0 |
0 |
0 |
0 |
0 |
0 |
| 19 |
a==4,b==3 |
0 |
0 |
6 |
0 |
0 |
6 |
| 20 |
_cons |
3 |
-4 |
6 |
-1 |
2 |
6 |
The column titled “Sum” is the sum over the previous five
columns and provides the row elements of the matrix x4.
Somtimes it is easiest to specify the equation, and other times it is easier
to specify the corresponding matrix in the test() option. Either
method works.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|