[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: RE: re: statsby slowness

From   "Nick Cox" <>
To   <>
Subject   st: RE: RE: re: statsby slowness
Date   Mon, 20 Aug 2007 10:01:55 +0100

The fact that -if- is always slower than an equivalent -in- 
I call Blasnik's Law, not because Michael discovered it, but 
because it needs a good name and he has done more than any 
other user to make people aware of it. 


keep in 1/100                        (1) 


keep if _n <= 100                    (2) 

and you imagine Stata implementing either of these. You 
should be able to tell at a glance that they mean
the same thing, but you're a human and you are good 
at working out meanings. 

With (1), Stata can work out very fast to -keep-
the first 100 obs and -drop- everything else. 

With (2), Stata is obliged by its own rules to test
every observation number _n against <= 100, and 
to ask itself lots of questions like 

	_n is 2345. Is that <= 100? No. 
	So, don't -keep- this obs. 


	_n is 123456789. Is that <= 100? No. 
	So, don't -keep- this obs. 
	_n is 123456790. Is that <= 100? No. 
	So, don't -keep- this obs. 

	and so on, 

because it has no intelligence to see the implications 
that once you are past 100, further testing is

Hence the rule: Use -in- rather than -if- when they 
are equivalent. Remember that with -if- Stata tests 
_every_ observation to check whether the condition is 
true, utterly regardless of whether it is "obvious"
that it need not do that. Stata doesn't do "obvious". 


Nick Cox
> Interesting. You may get a bit more speed if 
> you replace this 
> egen rank_1 = rank(expression), by(ssrownum)
> egen rank_2 = rank(iso_VSV), by(ssrownum)
> egen corr = corr(rank_1 rank_2), by(ssrownum)
> by this: 
> sort ssrownum
> by ssrowsum : egen rank_1 = rank(expression)
> by ssrowsum : egen rank_2 = rank(iso_VSV)
> by ssrowsum : egen corr = corr(rank_1 rank_2)
> The two code segments are equivalent in what 
> you end with, but not in when they -sort-. 
> SImilarly 
> keep if _n >= `start' & _n <= `stop'
> should be faster as
> keep in `start'/`stop'
> and I would always use the built-in -sqrt()- 
> when it applies, rather than powering to 0.5. 

*   For searches and help try:

© Copyright 1996–2022 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index