Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re-re-post: Stata 11 - Factor variables in a regression command


From   Michael Norman Mitchell <[email protected]>
Subject   Re: Re-re-post: Stata 11 - Factor variables in a regression command
Date   Sat, 01 May 2010 10:31:05 -0700

Greetings
  Richard Williams wrote...

--- snip ---
As the original example shows, the fits produced by the first two syntaxes are identical.
--- snip ---

  I completely agree with Richard, that

. logistic y a#b

and

. logistic y a##b

both are two different ways of parameterizing a model with two categorical predictors. If we let factor a have A levels, and factor b have B levels, then both models will have

  (A-1) + (B-1) + (A-1)*(B-1)

parameters in the model. In fact, this illustrates how the parameters are decomposed in a traditional parameterization (i.a i.b a#b), decomposing it into "main effect of a" (A-1 df), "main effect of b" (B-1 df), and "a by b interaction" ( (A-1)*(B-1) df).

If, instead one specifies -a#b-, this term has (A-1) + (B-1) + (A-1)*(B-1) , and is no longer partitioned into main effect of a, main effect of b, and interaction. The omnibus test of this effect is the overall test of the null hypothesis that there is simultaneously no main effect of a, no main effect of b, and no a by b interaction. As I show below, it simply tests the equality of means in all of the cells. I think this is rarely of research interest when one has this kind of "factorial" layout.

So, if this is what the omnibus test is doing, what about the individual paramters. Looking at Ricardo's initial example

----------------------------------------------------------------------------
          y | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Int.]
-----------+----------------------------------------------------------------
        a#b |
       0 1  |   1.567419   .2804138     2.51   0.012     1.1038    2.2256
       1 0  |   1.447424   .2588797     2.07   0.039     1.0194    2.0551
       1 1  |   1.211988   .2246236     1.04   0.300     .84283    1.7428
----------------------------------------------------------------------------

   Note how this is much like a "oneway" layout of the data, where there are four groups, and one of the groups is an omitted group (the group a=0 b=0 is the omitted group). So, each of these parameters is testing whether the "cell" differs from the omitted cell. That is, the first parameter tests whether the cell labeled a=0 b=1 differs from the cell a=0 b=0. It is as though the design had been converted into having four groups (labled 1 2 3 4, and group 1 is the omitted group corresponding to a=0 b=0). Then, the tests compare group 2 vs. 1, group 3 vs 1, and group 4 vs. 1. The omnibus test of all the parameters, as noted above, tests the equality of all of the cell means.

   Returning to Richards point, as he notes this is just an alternative parameterization of the original model, now where each cell is compared to a reference cell. If this is the desired series of comparisons a researcher wants to make, this is a very useful and parameterization.

I hope that is useful to Ricardo, and any other readers,

Best regards,

Michael







Michael N. Mitchell
See the Stata tidbit of the week at...
http://www.MichaelNormanMitchell.com

On 2010-05-01 8.50 AM, Richard Williams wrote:
At 01:42 AM 5/1/2010, Michael Norman Mitchell wrote:
Dear Ricardo

  The command

. logistic y a#b

includes just the interaction of "a by b", and does not include the main effect of a, nor the main effect of b. By contrast, the command

. logistic y a##b

includes the main effect of a, the main effect of b, as well as the a by b interaction. It is equivalent to typing

. logistic y a#b a b

I don't think this is quite right. As the original example shows, the fits produced by the first two syntaxes are identical. So, a#b and a##b are different ways of parameterizing the models. a##b gives you the main effect of a, the main effect of b, and the interaction, i.e. it is the same as entering a, b, and a*b in the model. a*b = 1 if a and b both equal 1, 0 otherwise. I believe this is equivalent to your 3rd syntax, except I would say i.a and i.b so Stata knows these are categorical variables.

With a#b, there are four possible combinations of values: 0 0, 0 1, 1 0, and 1 1. The first gets dropped and the other three are in the model.

These are two parameterizations of the same model; personally I prefer the a##b approach because it separates main effects from interaction effects.

The following example illustrates the 3 different approaches, and shows the equivalence of the last 2 approaches in Michael's example:

. use "http://www.indiana.edu/~jslsoc/stata/spex_data/ordwarm2.dta";, clear
(77 & 89 General Social Survey)

. logit  warmlt2 yr89#male, nolog

Logistic regression Number of obs = 2293 LR chi2(3) = 64.74 Prob > chi2 = 0.0000 Log likelihood = -851.54241 Pseudo R2 = 0.0366

------------------------------------------------------------------------------ warmlt2 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+----------------------------------------------------------------
   yr89#male |
0 1 | .1816812 .1431068 1.27 0.204 -.098803 .4621655 1 0 | -1.295833 .229115 -5.66 0.000 -1.74489 -.8467762 1 1 | -.659902 .2022755 -3.26 0.001 -1.056355 -.2634493
             |
_cons | -1.667376 .1021154 -16.33 0.000 -1.867518 -1.467233 ------------------------------------------------------------------------------

. logit  warmlt2 yr89##male, nolog

Logistic regression Number of obs = 2293 LR chi2(3) = 64.74 Prob > chi2 = 0.0000 Log likelihood = -851.54241 Pseudo R2 = 0.0366

------------------------------------------------------------------------------ warmlt2 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.yr89 | -1.295833 .229115 -5.66 0.000 -1.74489 -.8467762 1.male | .1816812 .1431068 1.27 0.204 -.098803 .4621655
             |
   yr89#male |
1 1 | .4542502 .3050139 1.49 0.136 -.1435661 1.052066
             |
_cons | -1.667376 .1021154 -16.33 0.000 -1.867518 -1.467233 ------------------------------------------------------------------------------

. logit  warmlt2 i.yr89 i.male yr89#male, nolog

Logistic regression Number of obs = 2293 LR chi2(3) = 64.74 Prob > chi2 = 0.0000 Log likelihood = -851.54241 Pseudo R2 = 0.0366

------------------------------------------------------------------------------ warmlt2 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- 1.yr89 | -1.295833 .229115 -5.66 0.000 -1.74489 -.8467762 1.male | .1816812 .1431068 1.27 0.204 -.098803 .4621655
             |
   yr89#male |
1 1 | .4542502 .3050139 1.49 0.136 -.1435661 1.052066
             |
_cons | -1.667376 .1021154 -16.33 0.000 -1.867518 -1.467233 ------------------------------------------------------------------------------



-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
HOME:   (574)289-5227
EMAIL: [email protected]
WWW: http://www.nd.edu/~rwilliam

*
*   For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index