Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: conditional logistic


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: conditional logistic
Date   Thu, 25 Oct 2007 09:55:55 -0500

Ricardo Ovaldia <ovaldia@yahoo.com> asks, 

> What is the difference between conditional logistic
> regression grouping on clinic and unconditional
> logistic regression including clinic as a dummy
> (indicator) variable? That is, what is the difference
> in model assumptions and parameter estimates?

The difference is that the logistic regression estimates are inconsistent
and bad.

Let's deal with inconsistent first.  Think of what happens as the number of
observations goes to infinity.  Let's denote the number of clinics as n and,
just to make things easy, let's assume the number of observations within
clinic is the same for each clinic, and is m.  Then the total number of
observations is N = n*m.

What happens as N->ininity?  Presumably, the number of clinics increases.
In this thought experiment, you are presumably imagining a replication 
of the world as we observe it, with clinics serving roughly the same 
number of patients, so as number of patients grows, so do the number of 
clinics.  Said in our notation, we are imagining n going to infinity and 
m remaining constant.  In standard logistic regression, that means we are 
estimating n-1 coefficients for the clinics.  The number of coefficients 
is incrasing at the same rate as the number of observations, with the 
result that there is no convergence to all the usual statistical properties
you are used to estimators having.

This may sound arcane, but it isn't, as you can show via simulation.  Even
easier, however, is to think about a simpler problem.  Consider standard
logistic regression with a standard problem -- no clinics, nothing odd.  We'll
assume one RHS variable, say sex.  It will not surprise you to hear that with
just 4 observations, the estimates produced by the standard logistic
regression estimator are bad.  The estimates would turn good if we added
more observations, but it turns out that with just 4, the asymptotics have not
yet kicked in and the estimates produced by the standard logistic regression
estimator are bad, not merely poor.  By poor, I mean noisy.  By bad, I mean
biased, wrong, and having no good properties.

Now let's consider the clinic.  Let's pretend we have 1,000 clinics and 
4 observations per clinic.  What running 

           . xi: logistic outcome sex i.clinic 

amounts to as running separate logistic regressions for each clinic, but with
the constraint the the coefficient on sex is the same across them.  I just
told you that with 4 observations, standard logistic is bad.  Combining 1,000
bad results does not improve them; they are still bad.  If the results were
merely poor -- noisy -- then combining them would help, but that's not our
case.

On the other hand, if by N = n*m -> infinity we held n constant and let
m->infinity, we would get good results.  By m going to infinity, you will have
a world in which the number of clinics remains fixed but the number of
observations within clinic increases.  Under those circumstances, each
logistic regression would turn good once m got large enough, and combining
the results will make them even better.

So does it matter which thought experiment is in your mind?  No.  Whether you
imagine n->infinity or m->infinity, if you have m=4, you have insufficient
observations for the standard logistic gression estimator, and results will be
bad.  If you have m=20, then in most circumstances you do have sufficient
observations for the logistic estimator to work.  But if you were to get more
data and the first thought experiment is the correct one, meaning the number 
of clinics increase, the estimates will not get better, and that should
distrurb you.  More data usually means better estimates.

Due to mathematical trickery, the conditional logistic estimator does not
estimate the individual coefficients for each clinic and so avoids the problem
of the number of estimates increasing at the same rate as the number of
observations goes to infinity regardless of the decomposition of the increase.
I told you that, with just 4 observations, standard logistic regression is
bad.  So would be the conditional logistic regression with just one clinic.
But unlike the standard logisitic estimator, if you hold the size of clinics
constant and increase the number of them, results get better and better.
Give me a dataset with 20 clinics, and in most cases, I'm in asymptopia.
Results are trustworth and, given more data, they just get better and better.

-- Bill
wgould@stata.com

P.S.  Let me add a footnote to the argument above.  The footnote is 
      unimportant for the argument made, but is important in linear 
      regression problems.

      The gist of the problem in the standard logistic regression estimator 
      is that the number of estimated parameterse increases as the same 
      rate as the number of observations.  The same could be said of
      the linear regression estimator and yet there is no problem because 
      of it.  Why?  Because in the LR estimator, the problem of estimating 
      the clinic intercepts can be separated from the problem of estimating 
      the sex coefficient.  It just turns out that way because of the 
      linear nature of the linear-regression estimator.  The same is not 
      true of logistic.

      The logic, "if the number of estimates increases at the same rate as
      number of observations, there will be problems" is generally true,
      the exception being cases where there is a particular kind of
      separability, which happens only in the linear case.

<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index