[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"David W. Harless" <dwharles@vcu.edu> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: conditional logistic |

Date |
Thu, 25 Oct 2007 09:26:51 -0400 |

Ricardo Ovaldia wrote:

The most important difference is that logit/logistic regression with dummy variables for groups is inconsistent unless the number of observations per group is large. There is a brief discussion of this (including cites) in the manual entry for -clogit- (page 224 of Reference A-J for Release 9 Manual).Dear all, I posted this under a different header and did not get a reply. So let me ask the question better. What is the difference between conditional logistic regression grouping on clinic and unconditional logistic regression including clinic as a dummy (indicator) variable? Tha is, what is the difference in model assumptions and parameter estimates? Thank you, Ricardo.

Way back when in February, 2000 Bill Gould and Vince Wiggins posted the note pasted below which gives a good explanation of these issues.

Dave Harless

Jen Ireland <Jen.Ireland@bristol.ac.uk> wrote,

> I am estimating a logit model in which I have clustered the observationsUnless there is something very odd about Jen's problem about which he is not

> on the basis of a particular variable, not otherwise included in the

> model, as I have reason to believe that the observations may not be

> independent within the clusters.

> > A colleague has argued that I could do just as well by simply including

> the clustering variable as an explanatory variable in my model. Why is

> it better to use clustering?

telling us, I assume Jen's colleague is suggesting not that Jen simply include

the cluster variable as a single variable in his model, but that Jen include a

set of dummies for each value of the cluster variable.

Assume I have data grouped into clusters and I label the clusters 1, 2, 3, and so on. If I included the cluster variable as a single variable, I would obtain a single coefficient for the cluster variable -- call it b -- and I would be saying that the effect of being in the first cluster is b, the effect

of being in the second cluster is 2*b, and so on.

But my labeling of the groups as cluster 1, 2, 3, is arbitrary, I assume. I

could just as well order the clusters, putting what is now cluster 3 into the

first postiion, cluster 1 in the second, and so on. Then I could call those

clusters 1, 2, 3 ..., and therein lies a problem.

So I assume that the suggestion was to include a dummy variable for the first cluster, another dummy variable for the second, and so on.

Given that interpretation, and with respect, I must disagree with Jen's

colleague. To make a long story short (which long story I am about to tell),

Jen's colleague perhaps wished to suggest Jen use conditional logistic

regression (clogit) as an alterntive to -logit, cluster() robust-. Had he

said that, I would, in some cases, have agreed.

The basis of Jen's collegue's comment

-------------------------------------

Rather than using the clustering correction to calculating the standard

errors, one could instead model the clustering. If one does that, and if one

has the modeling (meaning the assumptions) right -- one should be able to

produce more efficient estimates than those produced by -robust cluster()-.

Within-cluster correlation can arise for any number of reasons, but one particular reason is that each cluster has its own intercept. In that case,

one is tempted to estimate those intercepts by simply including the dummy

variables.

That approach works in the case of linear regression, but it does not work in

general. Said technically, the asympotics are violated. Call the number of

clusters n and the average number of observations within cluster T, so that

the total number of obsrvations is N=n*T. As T->infinity, all is well. As

n->infinity, however, both the number of estimated parameters (coefficients on

the dummy variables) and the number of observations are going to infinity

together and only in strange cases does it work out that any of the estimated

parameters approach their true values.

The strange case is linear regression and that occurs because it is linear

(although the reason is not transparent).

In the case of logistic regression, however, the estimates one obtains from

including all the dummies are biased and, even as n->infinity, that bias never

goes away. Vince Wiggins <vwiggins@stata.com> and I recently simulated this

and discovered that this not a sterile, theoretical argument -- the estimates

on obtains for the parameters are genuinely bad.

To obtain good estimates, one must develop a new estimator. Models with

separate intercepts per cluster are known as "fixed-effects models". In the

case of logistic regression, this fixed-effects estimator is conditional

logistic regression.

Thus, conditional logistic regression -- Stata's -clogit- command -- is an

alternative to using -robust cluster()-. In the case where the correlation

arises because of fixed effects (different intercepts across groups), -clogit-

is better is than -robust cluster()- because it produces more efficient

estimates, meaning more accurate estimates with smaller standard errors

and it is even better than that because there is now more going on in this

model than just correlation within cluster (namely, the possibility of correlation of the fixed effects with other covariates) and -clogit- is taking that into account, too.

However, correlation within group can arise for a lot of reasons. Perhaps the observations within groups are serially correlated, or perhaps two of the

observations are whoppingly correlated and, after that, there is not much

correlation at all, or perhaps the correlation structure differs across the

clusters. In that case, -clogit- will not produce correct standard errors.

Meanwhile, -robust cluster()- will continue to produce correct standard errors

for it's inefficient but population-wise consistent estimates.

-- Bill -- Vince

wgould@stata.com vwiggins@stata.com

* * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: conditional logistic***From:*Ricardo Ovaldia <ovaldia@yahoo.com>

**References**:**Re: st: conditional logistic***From:*Ricardo Ovaldia <ovaldia@yahoo.com>

- Prev by Date:
**RE: st: Problem with stset and sttocc** - Next by Date:
**RE: st: p value precision in clogit output** - Previous by thread:
**Re: st: conditional logistic** - Next by thread:
**Re: st: conditional logistic** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |