Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Interaction terms interpretation when one variable is omitted

From   "Mirnezami, Oliver" <>
To   "" <>
Subject   RE: st: Interaction terms interpretation when one variable is omitted
Date   Tue, 16 Apr 2013 19:05:47 +0000

Dear David

Thank you ever so much for the detailed reply.

Yes I completely agree that the 'disabled category' has too few observations and so I have combined them with the 'not in labour force' category.

Apologies for any confusion regarding the treat_status variable. I can confirm that the categories are indeed mutually exclusive and exhaustive. The data has been restricted so that everyone in the regressions are either in the control group (employed) or has lost their job between the previous and current wave and so falls into one of the several treatment categories e.g. treat_emp, treat_unemp, treat_ret etc. [Just to clarify that treat_emp means that individual lost their job at some point between the previous and current wave but has found employment in the current wave whereas those in treat_unemp have not found employment in the current wave following job loss. ]

I do include a female dummy variable along with a selection of other variables including ethnicity, education, wealth, smoking, bmi and a series of industry and firm size dummies. I just hadn't shown these for simplicity here. Married is a dummy variable indicating whether an individual is married or not in the current period and has got large enough frequencies. I will investigate interaction terms as well as you suggested. Regarding age, since the dataset is the US Health and Retirement Study, the average age of individuals in my regression is about 55. I'm going to look into what you suggested here regarding the functional form - thank you in particular for this tip! When I tried initially running my panel regression (left out some variables here again for simplicity), I get an error:

xtreg health treat_all  ln_income  female  y1996 y1998 y2000 y2002 , fe vce(cluster id) 
predict res, r
option r not allowed

It seems to work if I use reg rather than xtreg and remove the fe option but I need to use xtreg fe. 

Regarding the comparison of coefficients on treat_status, after combining the disabled category with 'not in labour force' and re-running the regression with i.treat_status, all of the treat_emp, treat_unemp, treat_ret and treat_nlbrf coefficients are insignificant. I did an F test for joint significance and got a p-value of 0.85. I also tried just treat_all (all categories combined) on its own and this had an individual p-value of 0.37. I see what you mean when you say that the health of the people who are not employed does not differ from the health of the people who are employed after adjusting for the contributions of the various explanatory variables due to the p-values. However, can I make any comments on the relative size of the coefficients, even if they are not significant or significantly different from each other e.g. using the output below, could I say that the effect of job loss on health appears stronger for those that are unemployed compared to those that gain!
  re-employment (-0.095 vs -0.023) although the effect is statistically insignificant? Or that due to the negative sign for all treatment categories, it appears that treatment has a negative effect on health compared to the control group although the effect is statistically insignificant? It's just that this makes intuitive sense. Or is it simply not worth commenting on? You also suggested assessing the contribution of treat_status as a whole by running the regression without it and comparing the two models - I did this and the coefficients were pretty much exactly the same so treat_status doesn't seem to alter the model much at all. 
                treat_emp  |  -.0229748   .0380209    -0.60   0.546    -.0975028    .0515533
                treat_unemp  |   -.094682   .1218764    -0.78   0.437    -.3335827    .1442186
                treat_ret  |  -.0540969   .1030903    -0.52   0.600    -.2561732    .1479795
                treat_nlbrf  |  -.0968564   .1647246    -0.59   0.557    -.4197475    .2260347

Another query I have is when I run the model with fixed effects as I showed you previously, many of the explanatory variables are insignificant although with random effects more are significant and sometimes the signs change as well. Is it just the case that the fixed effects are capturing most of the variation which is why the explanatory variables appear insignificant? 

Another concern is to do with the fact that the regression samples typically contain multiple observations over time for each individual although the actual number varies so some individuals may only appear once whereas others several times. Is this an issue and do you have any advice on how it can be resolved? I wondered if I should include some kind of weighting to account for this but am not sure of the theory behind it or how to do so in Stata?

Could you please elaborate on why logs base 10 are a more useful choice than natural logs? I logged income in order to get a distribution that appeared more like a normal distribution as the data was initially skewed when looking at a histogram.

Thank you again. I really appreciate your detailed responses. 

Kind regards


-----Original Message-----
From: [] On Behalf Of David Hoaglin
Sent: 13 April 2013 15:07
Subject: Re: st: Interaction terms interpretation when one variable is omitted

Dear Oliver,

Thank you for sharing the additional information.

With a frequency of only 5 out of over 30,000, the "disabled" category may not be viable.  You may want to consider omitting those persons or combining that category with another category (perhaps "not in labour force").  If that frequency distribution combines all the waves (9 or maybe 10 waves?), those 5 observations could come from one person.

Do the definitions of the categories of treat_status ensure that the categories are mutually exclusive (and exhaustive)?  For example, I would interpret "unemployed" as being in the labour force and "not in labour force" as excluding "retired."

Turning to the regression, do your data come only from men?  If not, should you include an indicator for women?

I gather that the model uses age as a continuous variable.  Depending on the range of ages in the data, the effect of age may not be linear.
 You can get information on the contribution of age by running the regression without age as a predictor and getting the residuals (the health residuals), running that regression again with age as the dependent variable (instead of health) and getting those residuals (the age residuals), and then plotting the health residuals against the age residuals (a "partial regression plot" or "added variable plot").  Alternatively, you can replace continuous age in the regression with a categorical variable (using 5-year or even narrower intervals of age) and then plotting the coefficients of those categories against the age at the middle of the category.  Here again the aim is to learn about the contribution of age after you have adjusted for the contributions of the other predictors.  The plot in the second approach, in particular, should suggest a functional form for age if its contribution is not linear (e.g., a linear spline --- it is usually a mistake to attempt to deal with non!
 linearity by automatically adding a quadratic term to the model).

How many categories does married have?  Do all of those categories have large enough frequencies?

If the logarithmic scale is appropriate for income as an explanatory variable, logs base 10 are a more useful choice than natural logs.

Should the model also include any interactions?

Before you compare the coefficients for the categories of treat_status, please take note of the fact that only the coefficient for treat_status = 5 has a P-value < .05 (the next-smallest P-value is .431), and that category contains only 5 observations!  Further examination of the data is needed.  Taking the P-values into account, my summary is that, except for individuals who are disabled, the health of the people who are not employed does not differ from the health of the people who are employed, after adjusting for the contributions of the various explanatory variables.  You can assess the contribution of treat_status as a whole by running the regression without it and comparing the two models.

The constant in the regression model refers to people who are employed in the first wave, have age = 0, are in the first category of married, and have ln(income) = 0.  You can make it more interpretable by centering age and the log of income at suitable values (not necessarily their means).

I am surprised that the constant you got from re-running the model with treat as the only (non-constant) predictor did not differ more from the constant in the full regression.  The explanation lies in a point that I made in my previous message: The definition of each regression coefficient (including the constant) includes the list of other predictors in the model.  (Many textbooks do not explain this.) In the full model, the constant is adjusted for the contributions of the various explanatory variables, whereas in the second model the constant is not adjusted.

I didn't understand which individuals were in the separate regressions.  The explanation in the preceding paragraph applies to the constants in those regressions (their definitions are not the same).

Using the categorical variable treat_status seems all right, once you have dealt with the issues that I have raised above (I'm not an expert in your subject area).  The discussion in this message is probably more than you bargained for, but I hope it is helpful.


David Hoaglin

On Fri, Apr 12, 2013 at 9:54 AM, Mirnezami, Oliver <> wrote:
> Dear David
> Thank you so much for your help.
> Following your advice, I've made a new variable treat_status which is a categorical variable and equals 0 for the control group (anyone who is employed in the period) and then takes a value of 1 if treat_emp ==1 , 2 if treat_unemp ==1 , 3 if treat_ret ==1 etc. 4 = not in labour force, 5 = disabled.
> treat_status |
>            |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           0 |     29,869       97.66       97.66
>           1 |        436        1.43       99.08
>           2 |         87        0.28       99.37
>           3 |        123        0.40       99.77
>           4 |         66        0.22       99.98
>           5 |          5        0.02      100.00
> ------------+-----------------------------------
>       Total |     30,586      100.00
> I then ran the following regression using the factor variable notation 
> in Stata (I've included a few explanatory variables and also time 
> dummies)
> xtreg health i.treat_status age i.married ln(income) 
> `yeareffects1994to2010', fe vce(cluster id)
> treat_status |
>                 1  |  -.0196492   .0380383    -0.52   0.605    -.0942113    .0549129
>                 2  |  -.0938826   .1191151    -0.79   0.431    -.3273705    .1396053
>                 3  |  -.0601347   .1000886    -0.60   0.548    -.2563271    .1360578
>                 4  |  -.0004453   .1684459    -0.00   0.998     -.330631    .3297403
>                 5  |  -1.043159    .355558    -2.93   0.003    -1.740119   -.3461987
>                    |
>              _cons |   5.382643   .1899976    28.33   0.000     5.010212    5.755074
> Can I then just compare these coefficients and say that for example, people that are unemployed following job loss (category 2) have worse health than people who regain employment following job loss (category 1) i.e. compare -0.093 with -0.019. And all of these labour force statuses post job loss result in worse health on average compared to my control group (category 0) who have not experienced job loss as all have a negative sign in relation to the reference group. Does the constant just refer to the value of the control group?
> One thing that I found confusing was that when I re-ran the regression using the original binary treatment variable (i.e. 0 = control group, 1 = job loss and any labour force status), the constant was slightly different than above when using the categorical variable (5.37 vs 5.38). Why are the constants not the same when both refer to the same control group?
>    treat |   -.032478   .0365249    -0.89   0.374    -.1040735    .0391176
>              _cons |   5.371252   .1897432    28.31   0.000      4.99932    5.743184
> To show you the construction of this variable: (i.e. 0 = same control 
> group as categorical. 1 is the sum of all labour force statuses 
> categories.)
> treatj
> |      Freq.     Percent        Cum.
> ------------+-----------------------------------
>           0 |     29,869       97.66       97.66
>           1 |        717        2.34      100.00
> ------------+-----------------------------------
>       Total |     30,586      100.00
> One other query I had was when you mentioned about the constant term and the definitions of the predictor variables. You said that 'when the model includes treat_emp, but not treat_unemp or treat_ret, the individuals whose values on treat_unemp or treat_ret are accounted for by the constant term, and the coefficient of treat_emp would be interpreted as a comparison between the individuals for whom treat_emp = 1 and the aggregate of all other individuals.'
> However, originally when I did the series of separate regressions, I only had individuals that were:
> 1) either in the control group or treat_emp. The individuals in treat_unemp or treat_ret etc. were not present in the regression.
> 2) either in the control group or treat_unemp. The individuals in treat_emp or treat_ret etc. were not present in the regression.
> 2) either in the control group or treat_ret. The individuals in treat_emp or treat_unemp etc. were not present in the regression.
> So I thought that it would be ok because the reference point (i.e. the control group) was always the same each time. I checked this though and the constant term was different in each regression which confused me.
> I think I will stick with the categorical factor variable approach you suggested as this seems to work ok - I would be grateful if you could confirm that my interpretation when using this approach is correct and would appreciate any additional clarity on my other queries, particularly regarding the constant term.
> Thank you again. I really appreciate all your help.
> Kind regards
> Oliver

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2016 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index