[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: request for help - multi-level modelling with a big dataset usingxtlogit
> "Alves, Bernadette" wrote:
> > I'm a student looking for help with my MSc dissertation
> looking at factors
> > associated with delivery by caesarean section. It's an analysis of a
> > database of about half a million records of women who gave birth in
> > hospital. I am using logistic regression and because my
> data are naturally
> > grouped, I'm using a multi-level approach to take account
> of the correlation
> > between women in the same hospital. I am therefore using
> xtlogit (rather
> > than logit). I find that I cannot run xtlogit with my
> entire 500,000
> > records - stata comes back with an error saying that it
> needs to be able to
> > set matsize to approximately 18,000. Unfortunately the
> matsize limit for
> > stata 7.0 is 800.
> > I then took a 4% sample (approximately 20,000 records )
> which is the largest
> > that stata can cope with at a matsize of 800. But, and
> here's the weird
> > thing that I need help with.... The parameter estimates are
> very dependent
> > on the sample I take. Sometimes I get a p-value of 0.05,
> for other samples I
> > get a p-value of 0.7. Here's an example of what I do to
> test whether
> > xdelmid is a predictor of emergency caesarean section.
> > sample 4 /* this give me the 4% sample */
> > xi: xtlogit emerg i.gestat i.age i.xdelmid, pa
> corr(exch) robust
> > i(provid)
> > testparm _Ixdel* /* this does a wald test on xdelmid */
> > Taking 10 different 4% sample, I find my estimates differ
> considerably and
> > my p-values range from 0.04 to 0.71.
> > Why can't stata cope with the full dataset and why are the parameter
> > estimates so sensitive to the sample taken?
> > I would be extremely grateful if someone could help me with this.
I know little about xtlogit and its memory requirements, so I can't
speak to that. But even with Stata SE, you would need a *huge* amount
of memory in your computer to run anything with a matsize of 11,000.
I would take your subsamples of complete locations -- that might be
causing the vast variation in significance across your samples.
* For searches and help try: