Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: indicator variable and interaction term different signs but both significant


From   Richard Williams <richardwilliams.ndu@gmail.com>
To   statalist@hsphsun2.harvard.edu, statalist@hsphsun2.harvard.edu
Subject   Re: st: indicator variable and interaction term different signs but both significant
Date   Mon, 08 Apr 2013 10:28:02 -0500

Thanks for your detailed response David. I appreciate it and I will go over it carefully.

I still wonder when and how the "common phrasing" is often incorrect, and, more critically, what terrible harms result from using that phrasing. With the problem that started this discussion, the common phrasing seemed to provide a straightforward explanation of why one shouldn't get too hung up on the sign and significance of the OC_D dummy once interactions are in the model. To me, the "common phrasing" may not technically reflect how regression works, but it does describe the logical implications of the regression models once you have estimated them. Your preferred phrasing, on the other hand, even if it is technically correct, strikes me as being very difficult to understand and is not at all intuitive. But again, I will go over your points more carefully.

I am curious, do you have objections to this example from the Stata Manuals?

use http://www.stata-press.com/data/r12/nhanes2
logistic highbp sex##agegrp##c.bmi
margins sex, at(bmi=(10(5)65))
marginsplot, xlabel(10(10)60)
sum bmi if female
sum bmi if !female

It includes values of BMI that are out of range for both men and women; and, perhaps more critically, it includes values that women have that men do not. i.e. the maximum BMI value for men is 53 and the maximum value for women is 61, and comparisons are made all the way up to 65. So, do you think it is only legitimate to do this across the range of 14 to 53, a range which both men and women have values for?

At 08:51 PM 4/7/2013, David Hoaglin wrote:
Richard,

Thanks for the thoughtful discussion.  I'm glad to elaborate.

The short answer, oversimplifying somewhat (but not a lot), is that
the "common phrasing" is incorrect, because it does not reflect the
way multiple regression works.  For reference, not tied to the present
example, one version of the common interpretation (which appears in
far too many books) is that a coefficient in a multiple regression
tells us about the change in y corresponding to an increase of 1 unit
in that predictor when the other predictors are held constant.  In
less categorical language, I usually say that, as a general
interpretation, it is oversimplified and often incorrect.  Thus, my
"preferred interpretation" is superior simply because it accurately
reflects the way multiple regression works (more below).

When you say, "According to the model ...," the phrase "when ... the
values of other variables are the same for both" is not actually
"according to the model."  The distinction may be clearer if you
consider the partial-regression plot (also called the "added-variable
plot") for a chosen predictor.  The vertical coordinate is the
residual from the regression of y on the other predictors, and the
horizontal coordinate is the residual from the regression of the
chosen predictor on the other predictors.  The slope of the regression
line through the origin of the partial-regression plot equals the
coefficient of the chosen predictor in the multiple regression (in
which the predictors are the chosen predictor and the other
predictors).  This result is straightforward mathematics, and it
motivates the interpretation that the coefficient of the chosen
predictor tells how the dependent variable changes per unit change in
that predictor after adjusting for simultaneous linear change in the
other predictors in the data at hand.  The adjustment consists of
freeing y (and the chosen predictor) of regression on the other
predictors.  The process of fitting a multiple regression model does
not hold those other predictors constant.  Cook and Weisberg (1982,
Section 2.3.2) give a proof.  I haven't tried to locate the earliest
proof, but Yule (1907, Section 9) has an elegant proof.  Mosteller and
Tukey (1977) have a chapter entitled "Woes of Regression Coefficients"
and a proof (in Section 14K).  The development of regression in the
introductory textbook by De Veaux et al. (2012) includes the correct
general interpretation.

My point about not extrapolating beyond the data is not moot, because
I was focusing mainly on size, leverage, litigation, private_D, and
same_D.

Multiple regression is often more complex than it appears.  To gain a
proper understanding, however, one has to come to grips with the
complexity.  The "held constant" interpretation of regression
coefficients introduces avoidable confusion and impedes proper
understanding.

I hope this discussion helps.

David Hoaglin

Cook RD, Weisberg S (1982).  Residuals and Influence in Regression.
Chapman and Hall.

De Veaux RD, Velleman PF, Bock DE (2012).  Stats: Data and Models, 3rd
ed.  Addison-Wesley.

Mosteller F, Tukey JW (1977).  Data Analysis and Regression.  Addison-Wesley.

Yule, GU (1907).  On the theory of correlation for any number of
variables, treated by a new system of notation.  Proceedings of the
Royal Society of London. Series A, Containing Papers of a Mathematical
and Physical Character.  79:182-193.


On Sun, Apr 7, 2013 at 4:58 PM, Richard Williams
<richardwilliams.ndu@gmail.com> wrote:

> Thanks David, but I admit I am still confused. According to the model, it is
> the case that "The coefficient for OC_D is the predicted difference between
> an overconfident manager and a regular manager when MV = 0 and the values of
> other variables are the same for both." If MV = 0 is an uninteresting or
> impossible value, that is pretty much a worthless thing to know, but it is
> still a correct statement.
>
> Part of what I like about my phrasing (which appears to be a more or less
> common phrasing) is that I believe it helps make clear (perhaps along with
> some graphs) why you generally shouldn't make a big deal of the coefficient
> for the dummy variable, in this case OC_D. It is simply the predicted
> difference between the two groups at a specific point, MV = 0, a point that
> may not even be possible in practice. Lines go off to infinity in both
> directions, and if the lines are non-parallel (as when there are
> interactions) there will be an infinite number of possible differences
> between the two lines, most of which will be totally uninteresting. I used
> to have students making statements like "once you control for female *
> income, the effect of female switches from positive to negative" and they
> tried to come up with profound theoretical explanations for that.
>
> I agree with you about being careful about extrapolating beyond the range of
> the data, but if MV = 0 isn't even theoretically possible it is kind of a
> moot point. Testing the statistical significance of any predicted values you
> compute should also give you some protection.
>
> The main thing, though, is that I am confused by your preferred wording:
> "The appropriate general interpretation of an estimated coefficient is that
> it tells how the dependent variable changes per unit change in that
> predictor after adjusting for simultaneous linear change in the other
> predictors in the data at hand." Why exactly is that a superior wording? I'm
> not even totally sure what that means. Are you just trying to warn against
> extrapolating beyond the observed range of the data? If so I think there is
> probably a more straightforward way of phrasing it. And, I don't think it is
> clear what "simultaneous linear change in the other predictors" is supposed
> to mean. Nor do I think the wording makes it clear what substantive
> interpretation you should give to the coefficient for OC_D.
>
> I think we are in agreement on most points, i.e. we both think there is
> little point on making a big deal of when MV = 0 when that may not be
> interesting or even possible -- but I don't understand why you think your
> preferred wording is better and other wordings are incorrect. But I'd be
> interested in hearing you elaborate.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL:  Richard.A.Williams.5@ND.Edu
WWW:    http://www.nd.edu/~rwilliam

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index