[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: conditional logistic
"David W. Harless" <email@example.com>
Re: st: conditional logistic
Thu, 25 Oct 2007 09:26:51 -0400
Ricardo Ovaldia wrote:
The most important difference is that logit/logistic regression with dummy variables for
groups is inconsistent unless the number of observations per group is large. There is a
brief discussion of this (including cites) in the manual entry for -clogit- (page 224 of
Reference A-J for Release 9 Manual).
I posted this under a different header and did not get
a reply. So let me ask the question better.
What is the difference between conditional logistic
regression grouping on clinic and unconditional
logistic regression including clinic as a dummy
(indicator) variable? Tha is, what is the difference
in model assumptions and parameter estimates?
Way back when in February, 2000 Bill Gould and Vince Wiggins posted the note pasted below which gives a good explanation of these issues.
Jen Ireland <Jen.Ireland@bristol.ac.uk> wrote,
> I am estimating a logit model in which I have clustered the observations
Unless there is something very odd about Jen's problem about which he is not
> on the basis of a particular variable, not otherwise included in the
> model, as I have reason to believe that the observations may not be
> independent within the clusters.
> > A colleague has argued that I could do just as well by simply including
> the clustering variable as an explanatory variable in my model. Why is
> it better to use clustering?
telling us, I assume Jen's colleague is suggesting not that Jen simply include
the cluster variable as a single variable in his model, but that Jen include a
set of dummies for each value of the cluster variable.
Assume I have data grouped into clusters and I label the clusters 1, 2, 3, and so on. If I included the cluster variable as a single variable, I would obtain a single coefficient for the cluster variable -- call it b -- and I would be saying that the effect of being in the first cluster is b, the effect
of being in the second cluster is 2*b, and so on.
But my labeling of the groups as cluster 1, 2, 3, is arbitrary, I assume. I
could just as well order the clusters, putting what is now cluster 3 into the
first postiion, cluster 1 in the second, and so on. Then I could call those
clusters 1, 2, 3 ..., and therein lies a problem.
So I assume that the suggestion was to include a dummy variable for the first cluster, another dummy variable for the second, and so on.
Given that interpretation, and with respect, I must disagree with Jen's
colleague. To make a long story short (which long story I am about to tell),
Jen's colleague perhaps wished to suggest Jen use conditional logistic
regression (clogit) as an alterntive to -logit, cluster() robust-. Had he
said that, I would, in some cases, have agreed.
The basis of Jen's collegue's comment
Rather than using the clustering correction to calculating the standard
errors, one could instead model the clustering. If one does that, and if one
has the modeling (meaning the assumptions) right -- one should be able to
produce more efficient estimates than those produced by -robust cluster()-.
Within-cluster correlation can arise for any number of reasons, but one particular reason is that each cluster has its own intercept. In that case,
one is tempted to estimate those intercepts by simply including the dummy
That approach works in the case of linear regression, but it does not work in
general. Said technically, the asympotics are violated. Call the number of
clusters n and the average number of observations within cluster T, so that
the total number of obsrvations is N=n*T. As T->infinity, all is well. As
n->infinity, however, both the number of estimated parameters (coefficients on
the dummy variables) and the number of observations are going to infinity
together and only in strange cases does it work out that any of the estimated
parameters approach their true values.
The strange case is linear regression and that occurs because it is linear
(although the reason is not transparent).
In the case of logistic regression, however, the estimates one obtains from
including all the dummies are biased and, even as n->infinity, that bias never
goes away. Vince Wiggins <firstname.lastname@example.org> and I recently simulated this
and discovered that this not a sterile, theoretical argument -- the estimates
on obtains for the parameters are genuinely bad.
To obtain good estimates, one must develop a new estimator. Models with
separate intercepts per cluster are known as "fixed-effects models". In the
case of logistic regression, this fixed-effects estimator is conditional
Thus, conditional logistic regression -- Stata's -clogit- command -- is an
alternative to using -robust cluster()-. In the case where the correlation
arises because of fixed effects (different intercepts across groups), -clogit-
is better is than -robust cluster()- because it produces more efficient
estimates, meaning more accurate estimates with smaller standard errors
and it is even better than that because there is now more going on in this
model than just correlation within cluster (namely, the possibility of correlation of the fixed effects with other covariates) and -clogit- is taking that into account, too.
However, correlation within group can arise for a lot of reasons. Perhaps the observations within groups are serially correlated, or perhaps two of the
observations are whoppingly correlated and, after that, there is not much
correlation at all, or perhaps the correlation structure differs across the
clusters. In that case, -clogit- will not produce correct standard errors.
Meanwhile, -robust cluster()- will continue to produce correct standard errors
for it's inefficient but population-wise consistent estimates.
-- Bill -- Vince
* For searches and help try: