[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Roger Harbord <rogerharbord@bigfoot.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: statsby slowness |

Date |
Tue, 14 Aug 2007 10:14:44 +0100 |

I've had similar experiences - based on a bit of experimentation, the time taken by -statsby- appears to be quadratic in the number of groups of the by() variable, at least once you get beyond a few hundred or so groups. For an example, see the code below. However, I'm not sure that giving up on -statsby- helps -- i've tried programming a simulation from scratch and the problem appeared to be inherent to -in- when you're selecting one of a large number of groups. I ended up splitting the data into chunks of 500 groups each and running each chunk separately (by the way I had good reasons for generating the whole 10000 simulations in one go rather than the approach taken by -simulate- of generating and estimating each simulation in turn, which is clearly often a better approach). I believe the speed improvements in Michael Blasnik's -statsbyfast- were incorporated into the official -statsby- some time ago (see -help whatsnew8_1- ), so -statsbyfast- is of historical interest only. An illustration : --------------------------------- sysuse auto, clear expand 1000 bysort make : gen int i = _n sort i make set rmsg on statsby t=(_b[weight] / _se[weight]), by(i) clear nodots : regress mpg weight set rmsg off --------------------------------- Change -expand 1000- to -expand 2000- and -statsby- takes not twice as long but four times as long. Changing -regress mpg weight- to -logit foreign weight- indicates the issue isn't unique to -regress- (or commands ultimately based on -regress-, such as -spearman- in David Airey's example below). David Airey wrote: > . > > I have found Blasnik's statsbyfast improvement, but for some reason it > is broken in Stata 10. > >> >> At what point does one give up using statsby? With just three >> variables in my data set, >> >> ssrownum, iso_VSV, expression >> >> the following command does OK with 1000 by groups (< 20 cases in a >> group), but is not useable with 20,000 by groups. >> >> statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): spearman iso_VSV >> expression >> >> Why? >> >> I posted something similar a long time ago compared speeds of ttest >> with if versus in and versus regress, but I'm not happy at the moment. >> >> > -- > David C. Airey, Ph.D. > Research Assistant Professor * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

