Kelly,
800K observations, 30 variables... I would predict it would have to
run about a day to compute the likelihood once (you can check that by
putting -dots- option and checking how frequently the dots appear). If
you are near the maximum already, you would have to take at least
those 30 steps to get the covariance matrix estimate. If you need to
converge to the maximum, you would still need to take say 5 iterations
-- with at least 10 likelihood evaluations per step. All in all, I am
giving your model a year to converge the way it is looking right now.
How to speed this up?
1. Run -xtlogit, re- (it will take about a couple of days to converge,
too, I imagine) to get the starting values. Note that the scaling for
the error term variance might be different between the two commands,
although frankly I don't remember the details. Obviously, -xtlogit-
puts the variance of the random slope equal to zero, so you could try
to give a small number for a starting value there.
2. Rescale all the variables so that the information matrix is nicely
conditioned. -gllamm- outputs the condition number, I would try
-noest- option to see what the condition number is -- it should be no
more than a 1000 for your Newton-Raphson steps to be relatively
efficient and quickly converging.
3. I've seen the scalability discussion of -gllamm- somewhere, don't
ask me for reference -- may be the manual, may be one of Sophia's
presentations. I think the computational time is (a) quadratic in the
number of factors (as you need to estimate their covariance matrix),
(b) proportional to (#integration points)^(# factors) -- it might be
the case that with your 4 points though, you cannot get accurate
enough approximations for your huge data set, (c) about linear in
sample size, (d) is increasing in the number of explanatory variables,
although I won't be able to pinpoint whether that's a linear or a
quadratic or any other increase -- basically, you need to invert
larger matrices with more variables, and it starts to get ugly very
quickly.
4. Take a sample of say 20 units from each of your hospitals (I am
assuming this is what HRR stands for), and account for that by using
#obs in HRR/20 sampling weight (or may be #obs in HRR/ave. #obs in
HRR, for arguably better small sample bias properties). With 2000
observations, it won't exactly fly, but it should converge in less
than a day, especially on a machine like yours. Repeat it a few times,
and combine the estimation results -- I would say in a fashion similar
to what you would do with multiply imputed data sets.
Regarding other software options, -gllamm- is slower than any of the
competitors, so for a simple model like yours, it might be reasonable
to try some alternatives -- at least to get the starting values if
that particular software estimation engine uses worse approximations.
On 4/30/07, Richardson, Kelly K. <Kelly.Richardson@va.gov> wrote:
Hi Stata Listers,
I am trying to run a GLLAMM model with a random intercept for HRR and a
random coefficient for black race. I have a sample size of 798,565 with
98 HRRs. There are 30 independent variables. Age and income are
continuously measured but all other variables are dichotomous including
the dependent variable. I thought about collapsing the data but I didn't
have any cases in which multiple patients had the same values on the
independent variables. I used the following code:
gen cons=1
eq hrr_cons: cons
eq hrr_black: race_black
gllamm died_admitp30 age female race_black zip_median~e elix_chf
elix_coagu~t elix_hyper~n elix_pvd elix_fluid~s elix_diabe~s elix_liver
elix_renal~l elix_copd elix_valve elix_obesity hannan97_cvd spec_prima~i
spec_cc_cr~h spec_ami_s~l spec_prevc~g spec_ami_i~t adtype_ele~e
adtype_urg~t adsour_tra~n adsour_eme~m yr9596 yr9798 yr9900 yr0102
yr0304, link(logit) fam(binom) i(hrrcode) nrf(2) eqs(hrr_cons hrr_black)
nip(4) adapt
The program ran for several days and finally the power went out in my
building and, well, I got nothin'. I'm wondering, before I start this
process again, am I running too large of a model? Is my sample size too
big? Is there something wrong with my code? I know the documentation
says these models take awhile to converge but should it be several days?
If my model is too large, what are the limits for GLLAMM models? Should
I do this model in HLM instead? Is it my machine?
I am working with:
Microsoft Windows XP
Professional
Version 2002
Service Pack 2
Dell Precision PWS670
Intel(R)
Xeon(TM) CPU 3.80 GHz
2.77 GHz, 3.25 GB of RAM
Any advice you can give would be greatly appreciated.
Thanks,
Kelly
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Kelly K. Richardson, PhD
--
Stas Kolenikov
http://stas.kolenikov.name
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/