[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Sergiy Radyakin" <serjradyakin@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: speed question: -collapse- vs -egen- |

Date |
Fri, 25 Apr 2008 21:12:55 -0400 |

Hello All! Jeph has asked about an efficient way of creating a dataset with means of one variable over the categories of another variable. He suggested two possible solutions and Stas added a third one. Below I report performance of each of these methods and compare it with the fourth: a plugin. I use an expanded version of auto.dta and tabulate mean {price} by different levels of {rep78}. 1. All methods resulted in the following table of results* meanprice rep78 4564.5 1 5967.625 2 6429.233 3 6071.5 4 5913 5 2. The timing is as follows (Stata SE, Windows Server 2003, 32-bit) 1: 33.80 / 1 = 33.7960 2: 31.22 / 1 = 31.2190 3: 21.33 / 1 = 21.3280 4: 5.58 / 1 = 5.5780 3. Since the plugin was intended for similar but not exactly the same purposes, it does some extra work (simultaneously computing frequencies, etc), which means that this is not the ultimate record. 4. The plugin must be "plugged-in" before use. To achieve this, I first call the plugin without timing, so that Stata loads the DLL and becomes aware of it. This process takes about 3 seconds on this particular machine, because each DLL loaded into memory is scanned by an antivirus on-the-fly. This time-loss is a one-time loss, and if Jeph calls this program routinely, he should not be concerned about this fixed cost. Even with this overhead, the plugin still easily beats all of the competition above. 5. The benchmark program is listed below. -ftabstat- is an ado-wrapper for the plugin. So if anyone wants to reproduce these results, must first obtain this plugin from me (please write "ftabstat" in the email subject line). 6. This particular plugin has a limitation of the matrix size for groups (which is 11000 in most versions of Stata). It also does not properly handle the missing values in categorical variable (that's why I discard observations with missing {rep78}), but this can all be done without a penalty in terms of execution time. Another limitation is that the plugin is platform-specific, and this one is for Win32 only (it will run with Stata 32-bit on Windows 64-bit, though). 7. If Jeph is concerned about speed - plugins are the way to go. 8. I would be glad to see a solution in Mata posted here to see how efficient would that be. 9. the name "ftabstat" comes from "fast tabstat" as originally the plugin was "tabstat on steroids". But then a couple of other features were added. tabstat price, by(rep78) save takes ~45 seconds to just create a matrix, so it does not even qualify. Have a good weekend, Sergiy Radyakin // Benchmark program speed.do ------------------------------------------------------ set more off set mem 500m log using speed.txt, text replace // get data sysuse auto, clear keep rep78 price keep if !missing(rep78) expand 100000 timer clear timer on 1 ftabstat price if _n==1 // kludge to load the plugin timer off 1 timer list 1 timer clear 1 preserve // start Jeph Herrin 1 timer on 1 bys rep78: egen meanprice=mean(price) bys rep78: keep if _n==1 keep rep78 meanprice timer off 1 list rep78 meanprice, clean noobs abb(10) restore preserve // start Jeph Herrin 2 timer on 2 collapse (mean) meanprice=price,by(rep78) timer off 2 list rep78 meanprice, clean noobs abb(10) restore preserve // start Stas Kolenikov timer on 3 gen byte one=1 if !missing(price) bys rep78: gen meanprice = sum(price)/sum(one) by rep78: keep if _n==_N keep rep78 meanprice timer off 3 list rep78 meanprice, clean noobs abb(10) restore preserve // start Sergiy Radyakin timer on 4 ftabstat price, by(rep78) drop _all matrix A=e(b)' svmat A,names(col) rename y1 meanprice matrix A=e(Row)' svmat A, names(col) rename r1 rep78 timer off 4 list rep78 meanprice, clean noobs abb(10) restore timer list log close // end of benchmark program speed.do ------------------------------------------------------ *PS: The solution suggested by Stas will yield incorrect results, since the sum of {one} will also count observations for which the variable of interest is missing. This is definitely a typo and I have corrected it before doing any comparisons by changing the following line: gen byte one = 1 to gen byte one = 1 if !missing(price) On 4/25/08, Stas Kolenikov <skolenik@gmail.com> wrote: > NJC can offer a precise answer, but my take would be > > gen byte one = 1 > bys group: gen varmean = sum(mean)/sum(one) > by group: keep if _n==_N > keep whatever > > Topics like those should've been covered somewhere in Nick's column in > Stata Journal, or in Stata tips. -egen- is slow as it does a lot of > checks and parsing and stuff -- for big processing jobs, single-liners > like above are always notably faster. -collapse- should be at least a > tad faster than -egen-, but again I would expect it to lose to the > above code. > > On Fri, Apr 25, 2008 at 2:37 PM, Jeph Herrin <junk@spandrel.net> wrote: > > > > I'm optimizing some code that needs to run often > > for a simulation, and am wondering if I should > > expect any difference in processing time between > > > > bys group: egen varmean=mean(myvar) > > bys group: keep if _n==1 > > keep group varmean > > > > and > > > > collapse (mean) varmean=myvar, by(group) > > > > and if so, which would be faster? > > > > I know I could run some tests myself, but figured > > that others had either already done so or at least > > would have some insight. > > > > > > > -- > Stas Kolenikov, also found at http://stas.kolenikov.name > > Small print: Please do not reply to my Gmail address as I don't check > it regularly. > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: speed question: -collapse- vs -egen-***From:*"Michael Blasnik" <michael.blasnik@verizon.net>

**References**:**st: speed question: -collapse- vs -egen-***From:*Jeph Herrin <junk@spandrel.net>

**Re: st: speed question: -collapse- vs -egen-***From:*"Stas Kolenikov" <skolenik@gmail.com>

- Prev by Date:
**Re: st: xtreg fe cluster and F-test** - Next by Date:
**st: RES: RE: using pscore and pstest** - Previous by thread:
**Re: st: speed question: -collapse- vs -egen-** - Next by thread:
**Re: st: speed question: -collapse- vs -egen-** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |