Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.

# Re: st: indicator variable and interaction term different signs but both significant

 From David Hoaglin To statalist@hsphsun2.harvard.edu Subject Re: st: indicator variable and interaction term different signs but both significant Date Mon, 8 Apr 2013 14:29:16 -0400

```Thanks, Richard.

I don't have examples of "terrible harms," but the common phrasing can
easily mislead, by giving the impression that one can hold the other
predictors constant when the data do not support such a statement and
(perhaps) by not keeping in view the other predictors, which have been
adjusted for (not, in general, "conrolled for").

Rather than continue to torment the poor indicator variable in the
initial example, let's look at two aspects of a generic multiple
regression: the regression equation and the data.

The regression equation, containing all the predictors, does not
specify an order in which the predictors entered the model.  At that
stage they are all included.  We could make a partial regression plot
for each of them, so the interpretation of the coefficient of each
predictor in the model should reflect the fact of the adjustment for
the contributions of the other predictors.  In other words, all the
predictors are in the equation together, and the role of each takes
into account the contributions of all the others.  Holding the values
of the other predictors constant is simply not part of the description
of the role of an individual predictor.

Once the fitted regression equation is in hand, the analyst must
decide how to use it.  If the data support predictions that change one
variable and hold other variables constant at specified values, well
and good.  It should usually be possible to make such predictions to
at least a limited extent, because we have tacitly assumed that we
have a good model.  Some sets of data are designed to support such
predictions over a sizable region of "predictor space."

My point is that the data determine the extent to which an analyst can
make such predictions.  Thus, the analyst has the obligation to
explain which predictions are well supported by the data and which are
extrapolations (and, if so, by how much).  The correct general
interpretation helps to keep that obligation in view, and it avoids
the impression that one can assign arbitrary constant values to the
other predictors.  In some situations it is not hard to imagine that
an incautious policymaker would manipulate a single policy-relevant
variable and be surprised at the unintended consequences that emerge
when other variables change along with it.

I have not had time to look at the details of the example that you
mentioned.  I would have a problem with unqualified extrapolations.
It is likely not to be a good idea to make comparisons involving BMI
between men and women outside the interval of BMI where one has data
from both men and women. If the predictions were accompanied by
appropriate confidence intervals, the widths of the CIs might give
some warning of the extrapolation, but I would prefer careful
examination of the extent of the data.

David Hoaglin

On Mon, Apr 8, 2013 at 11:28 AM, Richard Williams
<richardwilliams.ndu@gmail.com> wrote:
> Thanks for your detailed response David. I appreciate it and I will go over
> it carefully.
>
> I still wonder when and how the "common phrasing" is often incorrect, and,
> more critically, what terrible harms result from using that phrasing. With
> the problem that started this discussion, the common phrasing seemed to
> provide a straightforward explanation of why one shouldn't get too hung up
> on the sign and significance of the OC_D dummy once interactions are in the
> model. To me, the "common phrasing" may not technically reflect how
> regression works, but it does describe the logical implications of the
> regression models once you have estimated them. Your preferred phrasing, on
> the other hand, even if it is technically correct, strikes me as being very
> difficult to understand and is not at all intuitive. But again, I will go
> over your points more carefully.
>
> I am curious, do you have objections to this example from the Stata Manuals?
>
> use http://www.stata-press.com/data/r12/nhanes2
> logistic highbp sex##agegrp##c.bmi
> margins sex, at(bmi=(10(5)65))
> marginsplot, xlabel(10(10)60)
> sum bmi if female
> sum bmi if !female
>
> It includes values of BMI that are out of range for both men and women; and,
> perhaps more critically, it includes values that women have that men do not.
> i.e. the maximum BMI value for men is 53 and the maximum value for women is
> 61, and comparisons are made all the way up to 65. So, do you think it is
> only legitimate to do this across the range of 14 to 53, a range which both
> men and women have values for?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/
```