Title | Interpreting coefficients when interactions are in your model | |

Author | Kenneth Higbee, StataCorp |

I will illustrate what is happening with a simple example using
**regress**.
We will explore the hypotheses being tested as we change the base (omitted)
level when we have an interaction in a simple two-factor model.
For this simple example, each factor has only two levels.

The key conclusion is that, despite what some may believe, the test of a
single coefficient in a regression model when interactions are in the model
depends on the choice of base levels. Changing from one base to another
changes the hypothesis. Furthermore, the hypothesis for a test involving a
single regression coefficient is generally not the same as the hypothesis
tested by an ANOVA *F* test of the main effect of a factor. This may be
counterintuitive at first glance, but it is true.

Take the following data:

. use http://www.stata.com/support/faqs/dta/anoregcoef.dta, clear . list, sepby(A B)

y A B | |

1. | 13 1 1 |

2. | 34 1 1 |

3. | 25 1 1 |

4. | 30 1 1 |

5. | 28 1 2 |

6. | 10 1 2 |

7. | 41 1 2 |

8. | 11 2 1 |

9. | 55 2 1 |

10. | 87 2 2 |

11. | 25 2 2 |

12. | 14 2 2 |

13. | 42 2 2 |

14. | 89 2 2 |

15. | 52 2 2 |

16. | 38 2 2 |

17. | 45 2 2 |

B | ||||

A | 1 2 | Total | ||

1 | 25.5 26.333333 | 25.857143 | ||

4 3 | 7 | |||

2 | 33 49 | 45.8 | ||

2 8 | 10 | |||

Total | 28 42.818182 | 37.588235 | ||

6 11 | 17 |

We have a 2 × 2 table with unbalanced data—that is, different sample sizes (4, 3, 2, and 8) in each cell. We will refer to the 2 × 2 table above and will compare its values and means to those in other regression tables. These comparisons can help us better understand what hypotheses are being tested.

Let’s start by thinking of the overparameterized design matrix **X**:

| A#B | | A | | B | | 1 1 2 2 | | | | 1 2 | | 1 2 | | 1 2 1 2 | | _cons | +-----+ +-----+ +---------+ +-------+ 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 0 1 1 0 0 0 1 0 1 X = 0 1 1 0 0 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1

We want to compute regression coefficients b = inv(X'X)*(X'y), but because of
the collinearities in **X** (A1 + A2 = **_cons**, B1 + B2 = **_cons**, ...), many of the
columns of **X** must be omitted to have a matrix of full rank that we
can invert.

Either the A1 or the A2 column needs to be omitted (or possibly the **_cons**,
but let’s not explore that right now). The column we omit corresponds to
what we call the *base level* for that factor. Likewise for B1 and
B2—one of them must be omitted to avoid collinearity with the
constant. Of the four columns of **X** for the A by B interaction, three of them
must be omitted (given that we are keeping one of the **A** columns, one of the
**B** columns, and **_cons**).

We could choose to omit the first level of both **A** and **B** (the A1 and B1
columns of **X**) and the columns corresponding to **A#B** that match up with those
selections (in this case, the first 3 columns of the part of **X** for **A#B**).

. regress y b1.A b1.B A#B

The above command is equivalent to Stata’s default of picking the first level to be the base when you simply type

. regress y i.A i.B A#B

or even more succinctly,

. regress y A##B

In all cases of **regress** in this FAQ, add the **allbaselevels**
option to get a more verbose regression table that indicates exactly which
columns of the **X** matrix were omitted. After the concept is
perfectly clear, you may choose not to use the **allbaselevels**
option because it seems overly verbose.

Instead of choosing **A** at level 1 and **B** at level 1 for the base, we could make three
other choices for base:

**A** at level 1, **B** at level 2

**A** at level 2, **B** at level 1

**A** at level 2, **B** at level 2

You can get these three other choices with these commands:

. regress y b1.A b2.B A#B . regress y b2.A b1.B A#B . regress y b2.A b2.B A#B

Run those four regressions, examine the coefficients, and compare them with the means shown in the table above.

Let’s start with the default base levels. Just to be clear on which
columns are dropped from the **X** matrix we showed above, first type
the command:

. regress y b1.A b1.B A#B, allbaselevels

Then for the sake of brevity here, we look at a condensed version of the same regression table.

. regress y b1.A b1.B A#B, noheader

y | Coefficient Std. err. t P>|t| [95% conf. interval] | |

2.A | 7.5 19.72162 0.38 0.710 -35.10597 50.10597 | |

2.B | .8333333 17.39283 0.05 0.963 -36.7416 38.40827 | |

A#B | ||

2 2 | 15.16667 25.03256 0.61 0.555 -38.9129 69.24623 | |

_cons | 25.5 11.38628 2.24 0.043 .9014315 50.09857 | |

The **_cons** coefficient, 25.5, corresponds to the mean of the A1,B1
cell in our 2 × 2 table. In other words, the constant in the
regression corresponds to the cell in our 2 × 2 table for our chosen base
levels (**A** at 1 and **B** at 1).

We get the mean of the A1,B2 cell in our 2 × 2 table, 26.33333, by
adding the **_cons** coefficient to the **2.B** coefficient (25.5 + 0.833333).

We get the mean of the A2,B1 cell in our 2 × 2 table, 33, by adding the
**_cons** coefficient to the **2.A** coefficient (25.5 + 7.5).

We get the mean of the A2,B2 cell in our 2 × 2 table, 49, by adding the
**_cons** coefficient to the **2.A** coefficient, the **2.B**
coefficient, and the **2.A#2.B** coefficient (25.5 + 7.5 + 0.8333 +
15.1667).

Let’s focus on the **2.A** coefficient, which equals 7.5. What
does it correspond to? It corresponds to the A2,B1 cell minus the A1,B1
cell. Looking back at our 2 × 2 table, that would be 33 − 25.5.
When you look at the test for that single regression coefficient, you are
testing this hypothesis: *with B set to 1*, is there a
difference between level 2 of

Now pick one of the other three regressions that uses a different combination of bases for the two factors. We pick the last one.

Just to be sure you are clear on what has been omitted from the **X** matrix,
type the command:

. regress y b2.A b2.B A#B, allbaselevels

Then for brevity, here is the same regression shown more compactly:

. regress y b2.A b2.B A#B, noheader

y | Coefficient Std. err. t P>|t| [95% conf. interval] | |

1.A | -22.66667 15.4171 -1.47 0.165 -55.97329 10.63995 | |

1.B | -16 18.00329 -0.89 0.390 -54.89375 22.89375 | |

A#B | ||

1 1 | 15.16667 25.03256 0.61 0.555 -38.9129 69.24623 | |

_cons | 49 8.051318 6.09 0.000 31.60619 66.39381 | |

Here the **_cons** coefficient, 49, equals the mean for the A2,B2 cell of
our 2 × 2 table. This corresponds to our choice of level 2 as our base
level for both **A** and **B**.

We get the mean of the A1,B2 cell, 26.3333, by adding the **_cons** coefficient
to the **1.A** coefficient, (49 + (-22.6667)).

We get the mean of the A2,B1 cell, 33, by adding the **_cons** coefficient to
the **1.B** coefficient, (49 + (-16)).

We get the mean of the A1,B1 cell, 25.5, by adding all four of the coefficients (49 + (-22.6667) + (-16) + 15.1667)

Let’s look closely at the **1.A** coefficient, which is -22.6667. That
coefficient corresponds to the A1,B2 cell minus the A2,B2 cell. From our
2 × 2 table, that would be 26.3333 − 49. When you look at the test for
that single regression coefficient, you are testing the hypothesis: *with B
set to 2*, is there a difference between level 1 of

The hypothesis for the test of the **1.A** coefficient in this model is
not equivalent to the hypothesis for the test of the **2.A** coefficient
in the previous regression model. They are both testing **A**, but in
the first case it is a test of **A** with **B** set to 1. In this
second case, it is a test of **A** with **B** set to 2.

In the first test, the *p*-value was 0.710. In the second, the *p*-value is
0.165. These are very different *p*-values for this dataset, but this is not shocking
because they are testing different hypotheses.

I could illustrate what the coefficients represent in the other two
regressions (where we pick other combinations of the levels of **A** and **B** to be
the base), but I will refrain because it would make a long FAQ even longer.

The ANOVA test of the main effect of **A** is a different test from both of the
coefficient tests shown above.

. anova y A B A#BNumber of obs = 17 R-squared = 0.2330 Root MSE = 22.7726 Adj R-squared = 0.0560

Source | Partial SS df MS F Prob > F | ||

Model | 2048.45098 3 682.816993 1.32 0.3112 | ||

A | 753.126437 1 753.126437 1.45 0.2496 | ||

B | 234.505747 1 234.505747 0.45 0.5131 | ||

A#B | 190.367816 1 190.367816 0.37 0.5550 | ||

Residual | 6741.66667 13 518.589744 | ||

Total | 8790.11765 16 549.382353 |

The test of the main effect of **A** gives a *p*-value of 0.2496.
You get the same *p*-value for the main effect of **A** regardless
of whether you type the **anova** command as shown above or pick
different base levels. The following commands all give the same *F*
tests:

. anova y b1.A b1.B A#B . anova y b1.A b2.B A#B . anova y b2.A b1.B A#B . anova y b2.A b2.B A#B

How would you get the ANOVA main-effect *F* test for term **A** from
the underlying regression coefficients? Take a look at the
**symbolic** option of **test** after **anova**.

. quietly anova y A B A#B . test A

Source | Partial SS df MS F Prob > F | ||

A | 753.126437 1 753.126437 1.45 0.2496 | ||

Residual | 6741.66667 13 518.589744 |

For each of the regressions, we can get the same *F* test for the main effect
of **A** as shown by the ANOVA above. Type the following commands:

. regress y b1.A b1.B A#B . test _b[2.A] + 0.5*_b[2.A#2.B] = 0 . regress y b1.A##b2.B . test _b[2.A] + 0.5*_b[2.A#1.B] = 0 . regress y b2.A##b1.B . test _b[1.A] + 0.5*_b[1.A#2.B] = 0 . regress y b2.A##b2.B . test _b[1.A] + 0.5*_b[1.A#1.B] = 0

Refer back to the **test A, symbolic** table to see why the tests above
are set up the way they are. If you are not sure how I knew to type
**_b[2.A#2.B]** etc., use the **coeflegend** option of **regress**.

I admit that using the linear combination of regression coefficients **_b[2.A] +
0.5*_b[2.A#2.B]** (picking the first regression as an example) to produce the
*F* test for term **A**’s main effect is not obvious or intuitive.
Let’s look at the algebra when the first levels of **A** and **B** are the
base levels for our regression:

2 x 2 cell = linear combination of coefficients |

A1,B1 = _b[_cons]
A1,B2 = _b[_cons] + _b[2.B] A2,B1 = _b[_cons] + _b[2.A] A2,B2 = _b[_cons] + _b[2.A] + _b[2.B] + _b[2.A#2.B] |

You find that 0.5*(A2,B1 + A2,B2) − 0.5*(A1,B1 + A1,B2) equals
**_b[2.A] + 0.5*_b[2.A#2.B]**.

The *F* test in ANOVA for the main effect of **A** is testing the following
hypothesis: the average of the cell means when A is 2 − the average
of the cell means when A is 1 = 0.

A similar demonstration could be shown for the other three regression models where other base levels were selected.