Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
David Hoaglin <dchoaglin@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Interaction terms interpretation when one variable is omitted |

Date |
Sat, 13 Apr 2013 10:06:55 -0400 |

Dear Oliver, Thank you for sharing the additional information. With a frequency of only 5 out of over 30,000, the "disabled" category may not be viable. You may want to consider omitting those persons or combining that category with another category (perhaps "not in labour force"). If that frequency distribution combines all the waves (9 or maybe 10 waves?), those 5 observations could come from one person. Do the definitions of the categories of treat_status ensure that the categories are mutually exclusive (and exhaustive)? For example, I would interpret "unemployed" as being in the labour force and "not in labour force" as excluding "retired." Turning to the regression, do your data come only from men? If not, should you include an indicator for women? I gather that the model uses age as a continuous variable. Depending on the range of ages in the data, the effect of age may not be linear. You can get information on the contribution of age by running the regression without age as a predictor and getting the residuals (the health residuals), running that regression again with age as the dependent variable (instead of health) and getting those residuals (the age residuals), and then plotting the health residuals against the age residuals (a "partial regression plot" or "added variable plot"). Alternatively, you can replace continuous age in the regression with a categorical variable (using 5-year or even narrower intervals of age) and then plotting the coefficients of those categories against the age at the middle of the category. Here again the aim is to learn about the contribution of age after you have adjusted for the contributions of the other predictors. The plot in the second approach, in particular, should suggest a functional form for age if its contribution is not linear (e.g., a linear spline --- it is usually a mistake to attempt to deal with nonlinearity by automatically adding a quadratic term to the model). How many categories does married have? Do all of those categories have large enough frequencies? If the logarithmic scale is appropriate for income as an explanatory variable, logs base 10 are a more useful choice than natural logs. Should the model also include any interactions? Before you compare the coefficients for the categories of treat_status, please take note of the fact that only the coefficient for treat_status = 5 has a P-value < .05 (the next-smallest P-value is .431), and that category contains only 5 observations! Further examination of the data is needed. Taking the P-values into account, my summary is that, except for individuals who are disabled, the health of the people who are not employed does not differ from the health of the people who are employed, after adjusting for the contributions of the various explanatory variables. You can assess the contribution of treat_status as a whole by running the regression without it and comparing the two models. The constant in the regression model refers to people who are employed in the first wave, have age = 0, are in the first category of married, and have ln(income) = 0. You can make it more interpretable by centering age and the log of income at suitable values (not necessarily their means). I am surprised that the constant you got from re-running the model with treat as the only (non-constant) predictor did not differ more from the constant in the full regression. The explanation lies in a point that I made in my previous message: The definition of each regression coefficient (including the constant) includes the list of other predictors in the model. (Many textbooks do not explain this.) In the full model, the constant is adjusted for the contributions of the various explanatory variables, whereas in the second model the constant is not adjusted. I didn't understand which individuals were in the separate regressions. The explanation in the preceding paragraph applies to the constants in those regressions (their definitions are not the same). Using the categorical variable treat_status seems all right, once you have dealt with the issues that I have raised above (I'm not an expert in your subject area). The discussion in this message is probably more than you bargained for, but I hope it is helpful. Regards, David Hoaglin On Fri, Apr 12, 2013 at 9:54 AM, Mirnezami, Oliver <O.Y.Mirnezami@warwick.ac.uk> wrote: > Dear David > > Thank you so much for your help. > > Following your advice, I've made a new variable treat_status which is a categorical variable and equals 0 for the control group (anyone who is employed in the period) and then takes a value of 1 if treat_emp ==1 , 2 if treat_unemp ==1 , 3 if treat_ret ==1 etc. 4 = not in labour force, 5 = disabled. > > treat_status | > | Freq. Percent Cum. > ------------+----------------------------------- > 0 | 29,869 97.66 97.66 > 1 | 436 1.43 99.08 > 2 | 87 0.28 99.37 > 3 | 123 0.40 99.77 > 4 | 66 0.22 99.98 > 5 | 5 0.02 100.00 > ------------+----------------------------------- > Total | 30,586 100.00 > > I then ran the following regression using the factor variable notation in Stata (I've included a few explanatory variables and also time dummies) > > xtreg health i.treat_status age i.married ln(income) `yeareffects1994to2010', fe vce(cluster id) > > treat_status | > 1 | -.0196492 .0380383 -0.52 0.605 -.0942113 .0549129 > 2 | -.0938826 .1191151 -0.79 0.431 -.3273705 .1396053 > 3 | -.0601347 .1000886 -0.60 0.548 -.2563271 .1360578 > 4 | -.0004453 .1684459 -0.00 0.998 -.330631 .3297403 > 5 | -1.043159 .355558 -2.93 0.003 -1.740119 -.3461987 > | > _cons | 5.382643 .1899976 28.33 0.000 5.010212 5.755074 > > Can I then just compare these coefficients and say that for example, people that are unemployed following job loss (category 2) have worse health than people who regain employment following job loss (category 1) i.e. compare -0.093 with -0.019. And all of these labour force statuses post job loss result in worse health on average compared to my control group (category 0) who have not experienced job loss as all have a negative sign in relation to the reference group. Does the constant just refer to the value of the control group? > > One thing that I found confusing was that when I re-ran the regression using the original binary treatment variable (i.e. 0 = control group, 1 = job loss and any labour force status), the constant was slightly different than above when using the categorical variable (5.37 vs 5.38). Why are the constants not the same when both refer to the same control group? > > treat | -.032478 .0365249 -0.89 0.374 -.1040735 .0391176 > _cons | 5.371252 .1897432 28.31 0.000 4.99932 5.743184 > > To show you the construction of this variable: (i.e. 0 = same control group as categorical. 1 is the sum of all labour force statuses categories.) > > treatj > | Freq. Percent Cum. > ------------+----------------------------------- > 0 | 29,869 97.66 97.66 > 1 | 717 2.34 100.00 > ------------+----------------------------------- > Total | 30,586 100.00 > > > One other query I had was when you mentioned about the constant term and the definitions of the predictor variables. You said that 'when the model includes treat_emp, but not treat_unemp or treat_ret, the individuals whose values on treat_unemp or treat_ret are accounted for by the constant term, and the coefficient of treat_emp would be interpreted as a comparison between the individuals for whom treat_emp = 1 and the aggregate of all other individuals.' > > However, originally when I did the series of separate regressions, I only had individuals that were: > 1) either in the control group or treat_emp. The individuals in treat_unemp or treat_ret etc. were not present in the regression. > 2) either in the control group or treat_unemp. The individuals in treat_emp or treat_ret etc. were not present in the regression. > 2) either in the control group or treat_ret. The individuals in treat_emp or treat_unemp etc. were not present in the regression. > > So I thought that it would be ok because the reference point (i.e. the control group) was always the same each time. I checked this though and the constant term was different in each regression which confused me. > > I think I will stick with the categorical factor variable approach you suggested as this seems to work ok - I would be grateful if you could confirm that my interpretation when using this approach is correct and would appreciate any additional clarity on my other queries, particularly regarding the constant term. > > Thank you again. I really appreciate all your help. > > Kind regards > > Oliver * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: Interaction terms interpretation when one variable is omitted***From:*"Mirnezami, Oliver" <O.Y.Mirnezami@warwick.ac.uk>

**References**:**st: Interaction terms interpretation when one variable is omitted***From:*"Mirnezami, Oliver" <O.Y.Mirnezami@warwick.ac.uk>

**Re: st: Interaction terms interpretation when one variable is omitted***From:*David Hoaglin <dchoaglin@gmail.com>

**RE: st: Interaction terms interpretation when one variable is omitted***From:*"Mirnezami, Oliver" <O.Y.Mirnezami@warwick.ac.uk>

- Prev by Date:
**Re: st: SVY medians and Elixhauser** - Next by Date:
**Re: st: Query..** - Previous by thread:
**RE: st: Interaction terms interpretation when one variable is omitted** - Next by thread:
**RE: st: Interaction terms interpretation when one variable is omitted** - Index(es):