Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Dummy Variables vs. Subgroup Models in Logistic Regression

From   "Hoetker, Glenn" <[email protected]>
To   <[email protected]>
Subject   RE: st: Dummy Variables vs. Subgroup Models in Logistic Regression
Date   Fri, 22 Oct 2004 10:30:31 -0500

At 01:45 PM 10/22/2004 +0000, [email protected] wrote:

>Dear Stata Users,
>      I'm creating a logistic regression model with many dichotomous 
> variables along with one term that has 8 categories coded 1,2,..8.  I
> create 7 dummy variables and have a very large model.  Would it be 
> legitimate if my sample sizes are large enough to create 8 separate 
> models with each model representing one subgroup?   Can anyone comment
> the pros and cons of using dummy variables versus creating separate 
> "subgroup" models based on the remaining independent variables?

Comparing logit/probit coefficients across groups is actually
considerably more difficult than doing so in OLS.  This reflects the
fact that the betas are not identified in a logit model without imposing
a restriction by setting the variance of the error term to pi^2/3.  As a
result, the estimated coefficients are the underlying "true" effect
scaled by the amount of unobserved heterogeneity (a.k.a. residual
variation).  If the unobserved heterogeneity varies across groups, as it
often will, then the estimated betas will vary too, even if the "true"
effect is the same.  Allison (1999) discusses this and proposes a test
for detecting differences in unobserved heterogeneity and differences in
underlying coefficients.  Other discussions of the scale issue include
Maddala (1983:23), Long (1997:47), and Train (2004).

Hoetker (2004) uses Monte Carlo simulations to show that (a) the problem
Allison identified isn't just theoretical--it leads to misleading
inferences in common situations and (b) Allison's tests are a
significant improvement over current practice, but are not a panacea. It
also offers some alternative analytical approaches, including code in
Stata (of course) to implement them. One finding in particular is that
the use of interaction terms to detect inter-group differences in logit
equations if likely to yield misleading results if unobserved
heterogeneity differs across groups.  In some circumstances, it's
actually more likely to find significant results in the OPPOSITE
direction than in the right direction.

For cross-group comparisons in general, Liao (2002) is a helpful

Sorry to actually muddy the waters rather than providing a simple
solution.  Best wishes.

Glenn Hoetker
Assistant Professor of Strategy
College of Business
University of Illinois at Urbana-Champaign
[email protected]

Allison, P.D. 1999. Comparing logit and probit coefficients across
groups. SMR/Sociological Methods & Research 28(2): 186-208.

Hoetker, Glenn (2004). Confounded coefficients: Extending recent
advances in the accurate comparison of logit and probit coefficients
across groups. Working paper

Liao, T.F. 2002. Statistical group comparison. Wiley Series in
Probability and Statistics. New York : Wiley-Interscience.

Long, J.S. 1997. Regression models for categorical and limited dependent
variables. Advanced Quantitative Techniques in the Social Sciences.
Thousand Oaks, CA: Sage Publications.

Maddala, G.S. 1983. Limited-dependent and qualitative variables in
econometrics. New York: Cambridge University Press.

Train, K.E. 2004. Discrete choice methods with simulation. Cambridge :
Cambridge University Press. 

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Richard
Sent: Friday, October 22, 2004 9:42 AM
To: [email protected]; [email protected]
Subject: Re: st: Dummy Variables vs. Subgroup Models in Logistic

If you estimate separate models, you are allowing ALL parameters to
across groups, e.g. the effect of education could be different in each 
group.  If you just add dummies, you are allowing the intercept to
in each group, but the effects of the other variables stay the same.

If you estimate separate models for each group, your models will
be much less parsimonious, i.e. you'll have a lot more parameters
around. But the real question is, what is most appropriate given your 
theory and the empirical reality?  If the effects of everything really
different across every group, then you should estimate separate 
models.  But, if the effects do not differ across groups, then you are 
producing unnecessarily complicated models, and you are also reducing
statistical power, e.g. by not pooling groups when you should be pooling

them you'll be more likely to conclude that effects do not differ from
when they really do.

These sorts of issues are discussed in

Richard Williams, Notre Dame Dept of Sociology
OFFICE: (574)631-6668, (574)631-6463
FAX:    (574)288-4373
HOME:   (574)289-5227
EMAIL:  [email protected]
WWW (personal):
WWW (department):

*   For searches and help try:

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index