Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: weights in xtlogit


From   Stas Kolenikov <skolenik@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: weights in xtlogit
Date   Sun, 15 Nov 2009 22:27:42 -0600

You would also want to give the population parameters as starting
values to -xtlogit-. That way, you get rid of the first two-three
iterations, and that matters a lot in long Monte Carlo studies. You
could even try to augment those population values using a fast command
such as -logit- (if you know how to convert -logit- results to
-xtlogit- results/starting values), so that you can quickly figure out
which way the parameter estimates deviate from the population values
in this particular sample.

What Austin suggests is of course making sense as good approximations
that are well usable in practice and give you a good idea about
whether you have significant results in your particular data set,
simulated or real. (Austin gave me once his -clogit- example that
required one month to compute the first iteration even on my quad
machine, so he knows what he is talking about with computational
complexity of discrete data models :) .) But if you want to study the
properties of this particular estimation procedure then tweaking the
starting values, number of integration points and convergence criteria
is pretty much all you can realistically do.

On 11/15/09, Austin Nichols <austinnichols@gmail.com> wrote:
> If you have 1000 obs per cluster, you might prefer to include fixed
>  effects (50 or so dummies) in regular logit--the bias from using
>  dummies is quite small when the number of obs per cluster is large.
>  Also, if you have 50-60 clusters (20-30 per T/C group rather than
>  20-30 in both groups) you can use the cluster-robust VCE.  The bias
>  (in SEs) is quite small with more than 50 balanced clusters.  This
>  should be fairly easy to simulate, since -logit- is fast.  A better
>  simulation would compare to -xtlogit- of course, but 10000 iterations
>  of something that takes a day to converge is not feasible... though if
>  you supply the true parameter vector with option from(b), it might
>  converge a bit faster.
>
>  On Sat, Nov 14, 2009 at 10:58 PM, Airey, David C
>
> <david.airey@vanderbilt.edu> wrote:
>  > .
>  >
>  > Thanks Austin.
>  >
>  > I was trying to find ways of speeding up power simulations of xtlogit models where cluster size is large.
>  >
>  > If I have 20-30 clusters in control and treatment groups, where each cluster has upwards of 1000 or more 1/0 observations, then xtlogit can be slow.
>  >
>  > So variables might be clusterid (animal = cluster), yesno (1 or 0), and treatment (1 or 0).
>  >
>  > In 2008 after a similar query, Joseph Coveney posted a version of an xtlogit simulation in a situation where cluster number was small (12) and the cluster size was also not large (50). A slightly modified version of that code is below.
>  >
>  > Unfortunately or fortunately, now we have deep (aka next gen) sequencing data which is a technology that gives very larger cluster sizes in our application.
>  >
>  > One of our colleagues is looking at modeling this data with an overdispersed count model instead of a mixed logit. That might be a better solution.
>  >
>  > // notes follow "//" or "*"
>  > // modified from Coveney, 2008
>  > clear all
>  > set more off // allow results to scroll
>  > set seed `=date("2009-11-09", "YMD")' // for repeatable simulations
>  > *
>  > capture program drop edits // drop program from memory if error
>  > program define edits, rclass // program to be called in simulation
>  > version 11 // version 11 of Stata
>  > syntax [, delta(real 0) rho(real 0)] // input delta and rho
>  > drop _all
>  > tempname p_t // define a temporary variable for later
>  > clear
>  > set obs 12 // 12 mice...could be 50 now
>  > generate int pid = _n // integer pid numbered from 1 to 12
>  > generate byte trt = mod(_n, 2) // alternate 0 and 1
>  > // Generate a probability
>  > // using the cluster correlation rho and the standard
>  > // deviation of the logistic distribution, along with
>  > // the treatment effect and some normal error per animal.
>  > generate double prob = invlogit(trt * `delta' + ///
>  >        sqrt(`rho' / (1 - `rho') * _pi^2 / 3) * rnormal())
>  > assert !missing(prob) // if rho is misspecified
>  > generate int nseq = 100 // number of sequences per animal
>  > // now the number of sequences per animal could be 1000 or more
>  > gen bnl = rbinomial(nseq, prob) // binomial for (100, prob)
>  > *
>  > // ttest on percent summary statistic per animal
>  > gen percent = bnl/nseq*100
>  > ttest percent, by(trt)
>  > scalar define `p_t' = r(p)
>  > *
>  > // reshape the data from one score per animal
>  > // to one binary observation per row
>  > // so that a mixed logit model can be used...
>  > rename bnl count1
>  > generate int count0 = nseq - count1
>  > reshape long count, i(pid) j(positive) // can weights in xtlogit be used?
>  > drop if !count
>  > expand count
>  > xtlogit positive trt, i(pid) intpoints(30)
>  > // return actual rho and delta and
>  > // xtlogit and ttest pvalues for summarization
>  > return scalar rho = e(rho)
>  > return scalar p_r = chi2tail(1, e(chi2))
>  > return scalar delta = _b[trt]
>  > return scalar p_t = `p_t'
>  > end // program end
>  > *
>  > // call program above for treatment effects 0, 0.5, and 1
>  > // rho values of 0, 0.05, 0.10, and 0.15
>  > forvalues delta_int = 0(5)10 {
>  >        local delta = `delta_int'/10 // delta = 0.0, 0.5, and 1.0
>  >        forvalues rho_int = 0(5)15 {
>  >                local rho = `rho_int'/100 // rho = 0.0, 0.05, 0.10, 0.10
>  >                display in smcl as text _newline(1) ///
>  >                        "delta = " %03.2f `delta' " rho = " %03.2f `rho'
>  >                simulate delta=r(delta) rho=r(rho) p_r=r(p_r) ///
>  >                        p_t=r(p_t), reps(300): ///
>  >                        edits , rho(`rho') delta(`delta')
>  >                summarize delta rho
>  >                foreach var of varlist p_* {
>  >                        generate byte pos_`var' = `var' < 0.05
>  >                }
>  >                summarize pos_*
>  >        }
>  > }
>  > exit
>  >
>  >
>  > -Dave
>  >
>  >
>  >> Does
>  >> "two groups of clustered observations, but lots of observations per cluster"
>  >> mean id takes on two distinct values?  And you have many obs in each
>  >> category?  Just use logit with a dummy variable for id in that case.
>  >>
>  >> On Sat, Nov 14, 2009 at 2:16 PM, Airey, David C
>  >> <david.airey@vanderbilt.edu> wrote:
>  >> > .
>  >> >
>  >> > In logit models can one get faster results using weights if there are few covariate patterns? I thought I remembered reading that somewhere. If so, is it possible to do the same with xtlogit? So if you have the simplest of model, where you have two groups of clustered observations, but lots of observations per cluster, making estimation slow, can weights be used to speed up the estimation?
>  >> >
>  >> > xtlogit yn trt, i(id)
>  >
>  > *
>  > *   For searches and help try:
>  > *   http://www.stata.com/help.cgi?search
>  > *   http://www.stata.com/support/statalist/faq
>  > *   http://www.ats.ucla.edu/stat/stata/
>  >
>
>  *
>  *   For searches and help try:
>  *   http://www.stata.com/help.cgi?search
>  *   http://www.stata.com/support/statalist/faq
>  *   http://www.ats.ucla.edu/stat/stata/
>


-- 
Stas Kolenikov, also found at http://stas.kolenikov.name
Small print: I use this email account for mailing lists only.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index