Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: conditional logistic


From   wgould@stata.com (William Gould, StataCorp LP)
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: conditional logistic
Date   Fri, 26 Oct 2007 11:00:35 -0500

Concerning my posting on logistic regression with dummies vs. conditional 
logistic regression yesterday, Ricardo Ovaldia <ovaldia@yahoo.com> 
first quotes me, 

WG> The difference is that the logistic regression
WG> estimates are inconsistent
WG> and bad.

and then writes, 

RO> I must have missed something. From Bill's argument I
RO> see how this is true if you have few observation
RO> within clinic (for example 4), but I am not sure if
RO> the same holds if you have 4 clinics each with 200+
RO> observations which is the most ususal scenario.

First, the theoretical statement:

    Standard logistic regression on four clinics each with 200+ observations
    will generate good results.  One would usually expect to get even better
    results as the number of observations increases.  In this case, however,
    if the number of observations increases by increasing the number of
    clinics while holding the number of observations within clinics constant,
    I cannot prove that results will get better.

As a practical matter, Ricardo might be right.  I just did a simulation where
I assumed

     Case 1:
               lnodds(outcome_j) = 0 + 1*sex_j

     Case 2:

               lnodds(outcome_ij) = a_i + 1*sex_ij
               200 observations in each clinic i.

I did these simulations for overall sample sizes of 200, 400, 
800, 1600, 3200, 6400, 12800, 25600, and 25600.  Here are the results 
I got:

                           Case 1            Case 2
                     _b[sex] _se[sex]  _b[sex]  _se[sex]
        -------------------------------------------------
           800        1.1090    .15     1.0474    .17
          1600         .8713    .10     1.0494    .11
          3200         .9475    .07     1.0474    .08
          6400         .9898    .05     1.0803    .06
         12800         .9886    .04     1.0849    .04
         25600        1.0027    .03     1.0315    .03
        -------------------------------------------------

Theoretically, I know that Case 1 converges.  Theoretically, I cannot prove
that Case 2 will converge.  In this example, however, note that the standard
errors on the sex coefficient are both heading down and are, in fact, roughly
comparable.  In this case, the constant clinic size model seems to be
converging.

If it is convering, I have a suspicion as to why:  The model I simulated I
could have fitted using a linear probablity model and that model would have
fit the data about as well.  All the RHS variables are dummies and even from
linear regression, no prediction would have been below 0 or above 1.  I also
know that the linear regression model does converge in both cases 1 and 2.
Thus, it would not surprise me that the logistic regression estimator works
fine in this case.

On the other hand, my simulation is inadequate because I have not verified
that the coverage is right in case 2:  I report the standard error for one
regression in each row, and that reported standard error seems to be convering.
I should do a Monte Carlo on each row and check that the reported standard 
error has the correct coverage.

I could do all that.  I should do all that.  But I must ask:  Asuming 
one is not auditing individual clinics, why use -logit- or -logistic- in this
case when -clogit- is just as easy to use, is less computationally intensive,
and theoretically known to be right?  If one is auditing clinics, then 
we need to change the focus and start looking at how well -logit- and 
-logistic- estimate the individual clinic effects.

-- Bill
wgould@stata.com

P.S.  For anyone interested in pursuing this, here are the two programs 
      I used to produce the table above.

Do-file sim1.do makes columns 1 and 2 ov the table above:
----------------------------------------------------- sim1.do ----------------
clear all

program onecl
        args N
        quietly {
                drop _all
                set obs `N'
                gen sex = cond(_n<=_N/2, 0, 1)
                gen lnodds = 1*sex
                gen p      = exp(lnodds)/(1+exp(lnodds))
                gen int outcome = (uniform()<=p)
                logit outcome sex
        }
        di `N' " " _b[sex] " " _se[sex]
end

set seed 1939998
onecl 200
onecl 400
onecl 800
onecl 1600
onecl 3200
onecl 6400
onecl 12800
onecl 25600
onecl 51200
----------------------------------------------------- sim1.do ----------------

Do-file sim2.do makes columns 3 and 4:
----------------------------------------------------- sim2.do ----------------
clear all

program twocl
        args N
        local m = 200
        local n = `N'/`m'
        quietly {
                drop _all
                set obs `n'
                gen clinic = _n
                gen u      = 2.5*(uniform()-.5)
                expand `m'
                sort clinic
                gen sex = cond(mod(_n,2)==0,0,1)
                gen lnodds = 1*sex + u
                gen p      = exp(lnodds)/(1+exp(lnodds))
                gen int outcome = (uniform()<=p)
                xi: logit outcome sex i.clinic
        }
        di `N' " " _b[sex] " " _se[sex]
end

set seed 1939998
twocl 800
twocl 1600
twocl 3200
twocl 6400
twocl 12800
twocl 25600
twocl 51200
twocl 102400
----------------------------------------------------- sim2.do ----------------
<end>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index