Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: statsby slowness


From   Roger Harbord <rogerharbord@bigfoot.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: statsby slowness
Date   Tue, 14 Aug 2007 10:14:44 +0100

I've had similar experiences - based on a bit of experimentation, the 
time taken by -statsby- appears to be quadratic in the number of groups 
of the by() variable, at least once you get beyond a few hundred or so 
groups. For an example, see the code below.

However, I'm not sure that giving up on -statsby- helps -- i've tried 
programming a simulation from scratch and the problem appeared to be 
inherent to -in- when you're selecting one of a large number of groups.  
I ended up splitting the data into chunks of 500 groups each and running 
each chunk separately (by the way I had good reasons for generating the 
whole 10000 simulations in one go rather than the approach taken by 
-simulate- of generating and estimating each simulation in turn, which 
is clearly often a better approach).

I believe the speed improvements in Michael Blasnik's -statsbyfast- were 
incorporated into the official -statsby- some time ago (see -help 
whatsnew8_1- ), so -statsbyfast- is of historical interest only.

An illustration :
---------------------------------
sysuse auto, clear
expand 1000
bysort make : gen int i = _n
sort i make

set rmsg on
statsby t=(_b[weight] / _se[weight]), by(i) clear nodots : regress mpg 
weight
set rmsg off
---------------------------------
Change -expand 1000- to -expand 2000- and -statsby- takes not twice as 
long but four times as long.

Changing -regress mpg weight- to -logit foreign weight- indicates the 
issue isn't unique to -regress- (or commands ultimately based on 
-regress-, such as -spearman- in David Airey's example below).



David Airey wrote:
> .
>
> I have found Blasnik's statsbyfast improvement, but for some reason it 
> is broken in Stata 10.
>
>>
>> At what point does one give up using statsby? With just three 
>> variables in my data set,
>>
>> ssrownum, iso_VSV, expression
>>
>> the following command does OK with 1000 by groups (< 20 cases in a 
>> group), but is not useable with 20,000 by groups.
>>
>> statsby n=r(N) spearman=r(rho) p=r(p), by(ssrownum): spearman iso_VSV 
>> expression
>>
>> Why?
>>
>> I posted something similar a long time ago compared speeds of ttest 
>> with if versus in and versus regress, but I'm not happy at the moment.
>>
>>
> -- 
> David C. Airey, Ph.D.
> Research Assistant Professor



*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index