Why do I see different p-values, etc., when I change the base
level for a factor in my regression?
Why does the p-value for a term in my ANOVA not agree with the
p-value for the coefficient for that term in the corresponding
regression?
| Title |
|
Interpreting coefficients when interactions are in your model |
| Author |
Kenneth Higbee, StataCorp
|
| Date |
May 2010; minor revisions July 2011 |
I will illustrate what is happening with a simple example using
regress.
We will explore the hypotheses being tested as we change the base (omitted)
level when we have an interaction in a simple two-factor model.
For this simple example, each factor has only two levels.
The key conclusion is that, despite what some may believe, the test of a
single coefficient in a regression model when interactions are in the model
depends on the choice of base levels. Changing from one base to another
changes the hypothesis. Furthermore, the hypothesis for a test involving a
single regression coefficient is generally not the same as the hypothesis
tested by an ANOVA F test of a main effect of a factor. This may be
counterintuitive at first glance, but it is true.
Take the following data:
. use http://www.stata.com/support/faqs/dta/anoregcoef.dta, clear
. list, sepby(A B)
+------------+
| y A B |
|------------|
1. | 13 1 1 |
2. | 34 1 1 |
3. | 25 1 1 |
4. | 30 1 1 |
|------------|
5. | 28 1 2 |
6. | 10 1 2 |
7. | 41 1 2 |
|------------|
8. | 11 2 1 |
9. | 55 2 1 |
|------------|
10. | 87 2 2 |
11. | 25 2 2 |
12. | 14 2 2 |
13. | 42 2 2 |
14. | 89 2 2 |
15. | 52 2 2 |
16. | 38 2 2 |
17. | 45 2 2 |
+------------+
. tabulate A B, summarize(y) mean obs
Means and Number of Observations of y
| B
A | 1 2 | Total
-----------+----------------------+----------
1 | 25.5 26.333333 | 25.857143
| 4 3 | 7
-----------+----------------------+----------
2 | 33 49 | 45.8
| 2 8 | 10
-----------+----------------------+----------
Total | 28 42.818182 | 37.588235
| 6 11 | 17
We have a 2 × 2 table with unbalanced data—that is, different
sample sizes (4, 3, 2, and 8) in each cell.
We will refer to the 2 × 2 table above and will
compare its values and means to those in other regression tables. These
comparisons can help us better understand what hypotheses are being tested.
Let’s start by thinking of the overparameterized design matrix X:
| A#B |
| A | | B | | 1 1 2 2 | | |
| 1 2 | | 1 2 | | 1 2 1 2 | | _cons |
+-----+ +-----+ +---------+ +-------+
1 0 1 0 1 0 0 0 1
1 0 1 0 1 0 0 0 1
1 0 1 0 1 0 0 0 1
1 0 1 0 1 0 0 0 1
1 0 0 1 0 1 0 0 1
1 0 0 1 0 1 0 0 1
1 0 0 1 0 1 0 0 1
0 1 1 0 0 0 1 0 1
X = 0 1 1 0 0 0 1 0 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
0 1 0 1 0 0 0 1 1
We want to compute regression coefficients b = inv(X'X)*(X'y), but because of
the collinearities in X (A1 + A2 = _cons, B1 + B2 = _cons, ...), many of the
columns of X must be omitted to have a matrix of full rank that we
can invert.
Either the A1 or the A2 column needs to be omitted (or possibly the _cons,
but let’s not explore that right now). The column we omit corresponds to
what we call the base level for that factor. Likewise for B1 and
B2—one of them must be omitted to avoid collinearity with the
constant. Of the four columns of X for the A by B interaction, three of them
must be omitted (given that we are keeping one of the A columns, one of the
B columns, and _cons).
We could choose to omit the first level of both A and B (the A1 and B1
columns of X) and the columns corresponding to A#B that match up with those
selections (in this case, the first 3 columns of the part of X for A#B).
. regress y b1.A b1.B A#B
The above command is equivalent to Stata’s default of picking the first level to be
the base when you simply type
. regress y i.A i.B A#B
or even more succinctly,
. regress y A##B
In all cases of regress in this FAQ, add the allbaselevels
option to get a more verbose regression table that indicates exactly which
columns of the X matrix were omitted. After the concept is
perfectly clear, you may choose not to use the allbaselevels
option because it seems overly verbose.
Instead of choosing A at level 1 and B at level 1 for the base, we could make three
other choices for base:
A at level 1, B at level 2
A at level 2, B at level 1
A at level 2, B at level 2
You can get these three other choices with these commands:
. regress y b1.A b2.B A#B
. regress y b2.A b1.B A#B
. regress y b2.A b2.B A#B
Run those four regressions, examine the coefficients, and compare them with
the means shown in the table above.
Let’s start with the default base levels. Just to be clear on which
columns are dropped from the X matrix we showed above, first type
the command:
. regress y b1.A b1.B A#B, allbaselevels
Then for the sake of brevity here, we look at a condensed version of
the same regression table.
. regress y b1.A b1.B A#B, noheader
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
2.A | 7.5 19.72162 0.38 0.710 -35.10597 50.10597
2.B | .8333333 17.39283 0.05 0.963 -36.7416 38.40827
|
A#B |
2 2 | 15.16667 25.03256 0.61 0.555 -38.9129 69.24623
|
_cons | 25.5 11.38628 2.24 0.043 .9014315 50.09857
------------------------------------------------------------------------------
The _cons coefficient, 25.5, corresponds to the mean of the A1,B1
cell in our 2 × 2 table. In other words, the constant in the
regression corresponds to the cell in our 2 × 2 table for our chosen base
levels (A at 1 and B at 1).
We get the mean of the A1,B2 cell in our 2 × 2 table, 26.33333, by
adding the _cons coefficient to the 2.B coefficient (25.5 + 0.833333).
We get the mean of the A2,B1 cell in our 2 × 2 table, 33, by adding the
_cons coefficient to the 2.A coefficient (25.5 + 7.5).
We get the mean of the A2,B2 cell in our 2 × 2 table, 49, by adding the
_cons coefficient to the 2.A coefficient, the 2.B
coefficient, and the 2.A#2.B coefficient (25.5 + 7.5 + 0.8333 +
15.1667).
Let’s focus on the 2.A coefficient, which equals 7.5. What
does it correspond to? It corresponds to the A2,B1 cell minus the A1,B1
cell. Looking back at our 2 × 2 table, that would be 33 − 25.5.
When you look at the test for that single regression coefficient, you are
testing this hypothesis: with B set to 1, is there a
difference between level 2 of A and level 1 of A?
Now pick one of the other three regressions that uses a different combination
of bases for the two factors. We pick the last one.
Just to be sure you are clear on what has been omitted from the X matrix,
type the command:
. regress y b2.A b2.B A#B, allbaselevels
Then for brevity, here is the same regression shown more compactly:
. regress y b2.A b2.B A#B, noheader
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1.A | -22.66667 15.4171 -1.47 0.165 -55.97329 10.63995
1.B | -16 18.00329 -0.89 0.390 -54.89375 22.89375
|
A#B |
1 1 | 15.16667 25.03256 0.61 0.555 -38.9129 69.24623
|
_cons | 49 8.051318 6.09 0.000 31.60619 66.39381
------------------------------------------------------------------------------
Here the _cons coefficient, 49, equals the mean for the A2,B2 cell of
our 2 × 2 table. This corresponds to our choice of level 2 as our base
level for both A and B.
We get the mean of the A1,B2 cell, 26.3333, by adding the _cons coefficient
to the 1.A coefficient, (49 + -22.6667).
We get the mean of the A2,B1 cell, 33, by adding the _cons coefficient to
the 1.B coefficient, (49 + -16).
We get the mean of the A1,B1 cell, 25.5, by adding all four of the
coefficients (49 + -22.6667 + -16 + 15.1667)
Let’s look closely at the 1.A coefficient, which is -22.6667. That
coefficient corresponds to the A1,B2 cell minus the A2,B2 cell. From our
2 × 2 table, that would be 26.3333 − 49. When you look at the test for
that single regression coefficient, you are testing the hypothesis: with B
set to 2, is there a difference between level 1 of A
and level 2 of A?
The hypothesis for the test of the 1.A coefficient in this model is
not equivalent to the hypothesis for the test of the 2.A coefficient
in the previous regression model. They are both testing A, but in
the first case it is a test of A with B set to 1. In this
second case, it is a test of A with B set to 2.
In the first test, the p-value was 0.710. In the second, the p-value is
0.165. These are very different p-values for this dataset, but this is not shocking
because they are testing different hypotheses.
I could illustrate what the coefficients represent in the other two
regressions (where we pick other combinations of the levels of A and B to be
the base), but I will refrain because it would make a long FAQ even longer.
The ANOVA test of the main effect of A is a different test from both of the
coefficient tests shown above.
. anova y A B A#B
Number of obs = 17 R-squared = 0.2330
Root MSE = 22.7726 Adj R-squared = 0.0560
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
Model | 2048.45098 3 682.816993 1.32 0.3112
|
A | 753.126437 1 753.126437 1.45 0.2496
B | 234.505747 1 234.505747 0.45 0.5131
A#B | 190.367816 1 190.367816 0.37 0.5550
|
Residual | 6741.66667 13 518.589744
-----------+----------------------------------------------------
Total | 8790.11765 16 549.382353
The test of the main effect of A gives a p-value of 0.2496.
You get the same p-value for the main effect of A regardless
of whether you type the anova command as shown above or pick
different base levels. The following commands all give the same F
tests:
. anova y b1.A b1.B A#B
. anova y b1.A b2.B A#B
. anova y b2.A b1.B A#B
. anova y b2.A b2.B A#B
How would you get the ANOVA main-effect F test for term A from
the underlying regression coefficients? Take a look at the
symbolic option of test after anova.
. quietly anova y A B A#B
. test A
Source | Partial SS df MS F Prob > F
-----------+----------------------------------------------------
A | 753.126437 1 753.126437 1.45 0.2496
Residual | 6741.66667 13 518.589744
. test A, symbolic
A
1 -r2
2 r2
B
1 0
2 0
A#B
1 1 -1/2 r2
1 2 -1/2 r2
2 1 1/2 r2
2 2 1/2 r2
_cons 0
For each of the regressions, we can get the same F test for the main effect
of A as shown by the ANOVA above. Type the following commands:
. regress y b1.A b1.B A#B
. test _b[2.A] + 0.5*_b[2.A#2.B] = 0
. regress y b1.A##b2.B
. test _b[2.A] + 0.5*_b[2.A#1.B] = 0
. regress y b2.A##b1.B
. test _b[1.A] + 0.5*_b[1.A#2.B] = 0
. regress y b2.A##b2.B
. test _b[1.A] + 0.5*_b[1.A#1.B] = 0
Refer back to the test A, symbolic table to see why the tests above
are set up the way they are. If you are not sure how I knew to type
_b[2.A#2.B] etc., use the coeflegend option of regress.
I admit that using the linear combination of regression coefficients _b[2.A] +
0.5*_b[2.A#2.B] (picking the first regression as an example) to produce the
F test for term A’s main effect is not obvious or intuitive.
Let’s look at the algebra when the first levels of A and B are the
base levels for our regression:
2 x 2
cell = linear combination of coefficients
--------------------------------------------------------
A1,B1 = _b[_cons]
A1,B2 = _b[_cons] + _b[2.B]
A2,B1 = _b[_cons] + _b[2.A]
A2,B2 = _b[_cons] + _b[2.A] + _b[2.B] + _b[2.A#2.B]
You find that 0.5*(A2,B1 + A2,B2) − 0.5*(A1,B1 + A1,B2) equals
_b[2.A] + 0.5*_b[2.A#2.B].
The F test in ANOVA for the main effect of A is testing the following
hypothesis: the average of the cell means when A is 2 − the average
of the cell means when A is 1 = 0.
A similar demonstration could be shown for the other three regression models where other base
levels were selected.
|