[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Nick Cox" <n.j.cox@durham.ac.uk> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
st: RE: RE: re: statsby slowness |

Date |
Mon, 20 Aug 2007 10:01:55 +0100 |

The fact that -if- is always slower than an equivalent -in- I call Blasnik's Law, not because Michael discovered it, but because it needs a good name and he has done more than any other user to make people aware of it. Compare keep in 1/100 (1) and keep if _n <= 100 (2) and you imagine Stata implementing either of these. You should be able to tell at a glance that they mean the same thing, but you're a human and you are good at working out meanings. With (1), Stata can work out very fast to -keep- the first 100 obs and -drop- everything else. With (2), Stata is obliged by its own rules to test every observation number _n against <= 100, and to ask itself lots of questions like _n is 2345. Is that <= 100? No. So, don't -keep- this obs. .... _n is 123456789. Is that <= 100? No. So, don't -keep- this obs. _n is 123456790. Is that <= 100? No. So, don't -keep- this obs. and so on, because it has no intelligence to see the implications that once you are past 100, further testing is futile. Hence the rule: Use -in- rather than -if- when they are equivalent. Remember that with -if- Stata tests _every_ observation to check whether the condition is true, utterly regardless of whether it is "obvious" that it need not do that. Stata doesn't do "obvious". Nick n.j.cox@durham.ac.uk Nick Cox > Interesting. You may get a bit more speed if > you replace this > > egen rank_1 = rank(expression), by(ssrownum) > egen rank_2 = rank(iso_VSV), by(ssrownum) > egen corr = corr(rank_1 rank_2), by(ssrownum) > > by this: > > sort ssrownum > by ssrowsum : egen rank_1 = rank(expression) > by ssrowsum : egen rank_2 = rank(iso_VSV) > by ssrowsum : egen corr = corr(rank_1 rank_2) > > The two code segments are equivalent in what > you end with, but not in when they -sort-. > > SImilarly > > keep if _n >= `start' & _n <= `stop' > > should be faster as > > keep in `start'/`stop' > > and I would always use the built-in -sqrt()- > when it applies, rather than powering to 0.5. > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: RE: re: statsby slowness***From:*"Nick Cox" <n.j.cox@durham.ac.uk>

- Prev by Date:
**st: RE: re: statsby slowness** - Next by Date:
**st: RE: Macros half-evaluation** - Previous by thread:
**st: RE: re: statsby slowness** - Next by thread:
**st: re: statsby slowness** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |