Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
David Hoaglin <dchoaglin@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: indicator variable and interaction term different signs but both significant |

Date |
Sun, 7 Apr 2013 21:51:08 -0400 |

Richard, Thanks for the thoughtful discussion. I'm glad to elaborate. The short answer, oversimplifying somewhat (but not a lot), is that the "common phrasing" is incorrect, because it does not reflect the way multiple regression works. For reference, not tied to the present example, one version of the common interpretation (which appears in far too many books) is that a coefficient in a multiple regression tells us about the change in y corresponding to an increase of 1 unit in that predictor when the other predictors are held constant. In less categorical language, I usually say that, as a general interpretation, it is oversimplified and often incorrect. Thus, my "preferred interpretation" is superior simply because it accurately reflects the way multiple regression works (more below). When you say, "According to the model ...," the phrase "when ... the values of other variables are the same for both" is not actually "according to the model." The distinction may be clearer if you consider the partial-regression plot (also called the "added-variable plot") for a chosen predictor. The vertical coordinate is the residual from the regression of y on the other predictors, and the horizontal coordinate is the residual from the regression of the chosen predictor on the other predictors. The slope of the regression line through the origin of the partial-regression plot equals the coefficient of the chosen predictor in the multiple regression (in which the predictors are the chosen predictor and the other predictors). This result is straightforward mathematics, and it motivates the interpretation that the coefficient of the chosen predictor tells how the dependent variable changes per unit change in that predictor after adjusting for simultaneous linear change in the other predictors in the data at hand. The adjustment consists of freeing y (and the chosen predictor) of regression on the other predictors. The process of fitting a multiple regression model does not hold those other predictors constant. Cook and Weisberg (1982, Section 2.3.2) give a proof. I haven't tried to locate the earliest proof, but Yule (1907, Section 9) has an elegant proof. Mosteller and Tukey (1977) have a chapter entitled "Woes of Regression Coefficients" and a proof (in Section 14K). The development of regression in the introductory textbook by De Veaux et al. (2012) includes the correct general interpretation. My point about not extrapolating beyond the data is not moot, because I was focusing mainly on size, leverage, litigation, private_D, and same_D. Multiple regression is often more complex than it appears. To gain a proper understanding, however, one has to come to grips with the complexity. The "held constant" interpretation of regression coefficients introduces avoidable confusion and impedes proper understanding. I hope this discussion helps. David Hoaglin Cook RD, Weisberg S (1982). Residuals and Influence in Regression. Chapman and Hall. De Veaux RD, Velleman PF, Bock DE (2012). Stats: Data and Models, 3rd ed. Addison-Wesley. Mosteller F, Tukey JW (1977). Data Analysis and Regression. Addison-Wesley. Yule, GU (1907). On the theory of correlation for any number of variables, treated by a new system of notation. Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character. 79:182-193. On Sun, Apr 7, 2013 at 4:58 PM, Richard Williams <richardwilliams.ndu@gmail.com> wrote: > Thanks David, but I admit I am still confused. According to the model, it is > the case that "The coefficient for OC_D is the predicted difference between > an overconfident manager and a regular manager when MV = 0 and the values of > other variables are the same for both." If MV = 0 is an uninteresting or > impossible value, that is pretty much a worthless thing to know, but it is > still a correct statement. > > Part of what I like about my phrasing (which appears to be a more or less > common phrasing) is that I believe it helps make clear (perhaps along with > some graphs) why you generally shouldn't make a big deal of the coefficient > for the dummy variable, in this case OC_D. It is simply the predicted > difference between the two groups at a specific point, MV = 0, a point that > may not even be possible in practice. Lines go off to infinity in both > directions, and if the lines are non-parallel (as when there are > interactions) there will be an infinite number of possible differences > between the two lines, most of which will be totally uninteresting. I used > to have students making statements like "once you control for female * > income, the effect of female switches from positive to negative" and they > tried to come up with profound theoretical explanations for that. > > I agree with you about being careful about extrapolating beyond the range of > the data, but if MV = 0 isn't even theoretically possible it is kind of a > moot point. Testing the statistical significance of any predicted values you > compute should also give you some protection. > > The main thing, though, is that I am confused by your preferred wording: > "The appropriate general interpretation of an estimated coefficient is that > it tells how the dependent variable changes per unit change in that > predictor after adjusting for simultaneous linear change in the other > predictors in the data at hand." Why exactly is that a superior wording? I'm > not even totally sure what that means. Are you just trying to warn against > extrapolating beyond the observed range of the data? If so I think there is > probably a more straightforward way of phrasing it. And, I don't think it is > clear what "simultaneous linear change in the other predictors" is supposed > to mean. Nor do I think the wording makes it clear what substantive > interpretation you should give to the coefficient for OC_D. > > I think we are in agreement on most points, i.e. we both think there is > little point on making a big deal of when MV = 0 when that may not be > interesting or even possible -- but I don't understand why you think your > preferred wording is better and other wordings are incorrect. But I'd be > interested in hearing you elaborate. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: indicator variable and interaction term different signs but both significant***From:*Nahla Betelmal <nahlaib@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*Anthony Fulginiti <fulginit@usc.edu>

**Re: st: indicator variable and interaction term different signs but both significant***From:*Nahla Betelmal <nahlaib@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*Richard Williams <richardwilliams.ndu@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*Nahla Betelmal <nahlaib@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*David Hoaglin <dchoaglin@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*Richard Williams <richardwilliams.ndu@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*David Hoaglin <dchoaglin@gmail.com>

**Re: st: indicator variable and interaction term different signs but both significant***From:*Richard Williams <richardwilliams.ndu@gmail.com>

- Prev by Date:
**Re: st: SVY medians and Elixhauser** - Next by Date:
**Re: st: Panel Data Within Outside Region Effects** - Previous by thread:
**Re: st: indicator variable and interaction term different signs but both significant** - Next by thread:
**Re: st: indicator variable and interaction term different signs but both significant** - Index(es):