Re: st: statsby slowness

 From Roger Harbord To statalist@hsphsun2.harvard.edu Subject Re: st: statsby slowness Date Tue, 14 Aug 2007 10:14:44 +0100

```I've had similar experiences - based on a bit of experimentation, the
time taken by -statsby- appears to be quadratic in the number of groups
of the by() variable, at least once you get beyond a few hundred or so
groups. For an example, see the code below.

However, I'm not sure that giving up on -statsby- helps -- i've tried
programming a simulation from scratch and the problem appeared to be
inherent to -in- when you're selecting one of a large number of groups.
I ended up splitting the data into chunks of 500 groups each and running
each chunk separately (by the way I had good reasons for generating the
whole 10000 simulations in one go rather than the approach taken by
-simulate- of generating and estimating each simulation in turn, which
is clearly often a better approach).

I believe the speed improvements in Michael Blasnik's -statsbyfast- were
incorporated into the official -statsby- some time ago (see -help
whatsnew8_1- ), so -statsbyfast- is of historical interest only.

An illustration :
---------------------------------
sysuse auto, clear
expand 1000
bysort make : gen int i = _n
sort i make

set rmsg on
statsby t=(_b[weight] / _se[weight]), by(i) clear nodots : regress mpg
weight
set rmsg off
---------------------------------
Change -expand 1000- to -expand 2000- and -statsby- takes not twice as
long but four times as long.

Changing -regress mpg weight- to -logit foreign weight- indicates the
issue isn't unique to -regress- (or commands ultimately based on
-regress-, such as -spearman- in David Airey's example below).

David Airey wrote:
> .
>
> I have found Blasnik's statsbyfast improvement, but for some reason it
> is broken in Stata 10.
>
>>
>> At what point does one give up using statsby? With just three
>> variables in my data set,
>>
>> ssrownum, iso_VSV, expression
>>
>> the following command does OK with 1000 by groups (< 20 cases in a
>> group), but is not useable with 20,000 by groups.
>>
>> statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): spearman iso_VSV
>> expression
>>
>> Why?
>>
>> I posted something similar a long time ago compared speeds of ttest
>> with if versus in and versus regress, but I'm not happy at the moment.
>>
>>
> --
> David C. Airey, Ph.D.
> Research Assistant Professor

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```