Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Interaction terms interpretation when one variable is omitted

From   David Hoaglin <>
Subject   Re: st: Interaction terms interpretation when one variable is omitted
Date   Sat, 13 Apr 2013 10:06:55 -0400

Dear Oliver,

Thank you for sharing the additional information.

With a frequency of only 5 out of over 30,000, the "disabled" category
may not be viable.  You may want to consider omitting those persons or
combining that category with another category (perhaps "not in labour
force").  If that frequency distribution combines all the waves (9 or
maybe 10 waves?), those 5 observations could come from one person.

Do the definitions of the categories of treat_status ensure that the
categories are mutually exclusive (and exhaustive)?  For example, I
would interpret "unemployed" as being in the labour force and "not in
labour force" as excluding "retired."

Turning to the regression, do your data come only from men?  If not,
should you include an indicator for women?

I gather that the model uses age as a continuous variable.  Depending
on the range of ages in the data, the effect of age may not be linear.
 You can get information on the contribution of age by running the
regression without age as a predictor and getting the residuals (the
health residuals), running that regression again with age as the
dependent variable (instead of health) and getting those residuals
(the age residuals), and then plotting the health residuals against
the age residuals (a "partial regression plot" or "added variable
plot").  Alternatively, you can replace continuous age in the
regression with a categorical variable (using 5-year or even narrower
intervals of age) and then plotting the coefficients of those
categories against the age at the middle of the category.  Here again
the aim is to learn about the contribution of age after you have
adjusted for the contributions of the other predictors.  The plot in
the second approach, in particular, should suggest a functional form
for age if its contribution is not linear (e.g., a linear spline ---
it is usually a mistake to attempt to deal with nonlinearity by
automatically adding a quadratic term to the model).

How many categories does married have?  Do all of those categories
have large enough frequencies?

If the logarithmic scale is appropriate for income as an explanatory
variable, logs base 10 are a more useful choice than natural logs.

Should the model also include any interactions?

Before you compare the coefficients for the categories of
treat_status, please take note of the fact that only the coefficient
for treat_status = 5 has a P-value < .05 (the next-smallest P-value is
.431), and that category contains only 5 observations!  Further
examination of the data is needed.  Taking the P-values into account,
my summary is that, except for individuals who are disabled, the
health of the people who are not employed does not differ from the
health of the people who are employed, after adjusting for the
contributions of the various explanatory variables.  You can assess
the contribution of treat_status as a whole by running the regression
without it and comparing the two models.

The constant in the regression model refers to people who are employed
in the first wave, have age = 0, are in the first category of married,
and have ln(income) = 0.  You can make it more interpretable by
centering age and the log of income at suitable values (not
necessarily their means).

I am surprised that the constant you got from re-running the model
with treat as the only (non-constant) predictor did not differ more
from the constant in the full regression.  The explanation lies in a
point that I made in my previous message: The definition of each
regression coefficient (including the constant) includes the list of
other predictors in the model.  (Many textbooks do not explain this.)
In the full model, the constant is adjusted for the contributions of
the various explanatory variables, whereas in the second model the
constant is not adjusted.

I didn't understand which individuals were in the separate
regressions.  The explanation in the preceding paragraph applies to
the constants in those regressions (their definitions are not the

Using the categorical variable treat_status seems all right, once you
have dealt with the issues that I have raised above (I'm not an expert
in your subject area).  The discussion in this message is probably
more than you bargained for, but I hope it is helpful.


David Hoaglin

On Fri, Apr 12, 2013 at 9:54 AM, Mirnezami, Oliver
<> wrote:
> Dear David
> Thank you so much for your help.
> Following your advice, I've made a new variable treat_status which is a categorical variable and equals 0 for the control group (anyone who is employed in the period) and then takes a value of 1 if treat_emp ==1 , 2 if treat_unemp ==1 , 3 if treat_ret ==1 etc. 4 = not in labour force, 5 = disabled.
> treat_status |
>            |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           0 |     29,869       97.66       97.66
>           1 |        436        1.43       99.08
>           2 |         87        0.28       99.37
>           3 |        123        0.40       99.77
>           4 |         66        0.22       99.98
>           5 |          5        0.02      100.00
> ------------+-----------------------------------
>       Total |     30,586      100.00
> I then ran the following regression using the factor variable notation in Stata (I've included a few explanatory variables and also time dummies)
> xtreg health i.treat_status age i.married ln(income) `yeareffects1994to2010', fe vce(cluster id)
> treat_status |
>                 1  |  -.0196492   .0380383    -0.52   0.605    -.0942113    .0549129
>                 2  |  -.0938826   .1191151    -0.79   0.431    -.3273705    .1396053
>                 3  |  -.0601347   .1000886    -0.60   0.548    -.2563271    .1360578
>                 4  |  -.0004453   .1684459    -0.00   0.998     -.330631    .3297403
>                 5  |  -1.043159    .355558    -2.93   0.003    -1.740119   -.3461987
>                    |
>              _cons |   5.382643   .1899976    28.33   0.000     5.010212    5.755074
> Can I then just compare these coefficients and say that for example, people that are unemployed following job loss (category 2) have worse health than people who regain employment following job loss (category 1) i.e. compare -0.093 with -0.019. And all of these labour force statuses post job loss result in worse health on average compared to my control group (category 0) who have not experienced job loss as all have a negative sign in relation to the reference group. Does the constant just refer to the value of the control group?
> One thing that I found confusing was that when I re-ran the regression using the original binary treatment variable (i.e. 0 = control group, 1 = job loss and any labour force status), the constant was slightly different than above when using the categorical variable (5.37 vs 5.38). Why are the constants not the same when both refer to the same control group?
>    treat |   -.032478   .0365249    -0.89   0.374    -.1040735    .0391176
>              _cons |   5.371252   .1897432    28.31   0.000      4.99932    5.743184
> To show you the construction of this variable: (i.e. 0 = same control group as categorical. 1 is the sum of all labour force statuses categories.)
> treatj
> |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           0 |     29,869       97.66       97.66
>           1 |        717        2.34      100.00
> ------------+-----------------------------------
>       Total |     30,586      100.00
> One other query I had was when you mentioned about the constant term and the definitions of the predictor variables. You said that 'when the model includes treat_emp, but not treat_unemp or treat_ret, the individuals whose values on treat_unemp or treat_ret are accounted for by the constant term, and the coefficient of treat_emp would be interpreted as a comparison between the individuals for whom treat_emp = 1 and the aggregate of all other individuals.'
> However, originally when I did the series of separate regressions, I only had individuals that were:
> 1) either in the control group or treat_emp. The individuals in treat_unemp or treat_ret etc. were not present in the regression.
> 2) either in the control group or treat_unemp. The individuals in treat_emp or treat_ret etc. were not present in the regression.
> 2) either in the control group or treat_ret. The individuals in treat_emp or treat_unemp etc. were not present in the regression.
> So I thought that it would be ok because the reference point (i.e. the control group) was always the same each time. I checked this though and the constant term was different in each regression which confused me.
> I think I will stick with the categorical factor variable approach you suggested as this seems to work ok - I would be grateful if you could confirm that my interpretation when using this approach is correct and would appreciate any additional clarity on my other queries, particularly regarding the constant term.
> Thank you again. I really appreciate all your help.
> Kind regards
> Oliver

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index