Eric Booth <ebooth@ppri.tamu.edu>

statalist@hsphsun2.harvard.edu

Subject: Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?

Date: Sat, 10 Jul 2010 01:53:36 +0000

<> Hi Tony: I have rmsg on permanently, so I think in that sense this is simpler to use. I also like how, in addition to showing the time it takes to run each command & an entire do-file takes to run, rmsg also displays the actual timestamp for when a command completes--which can be useful when running something overnight. That being said, rmsg is often a better supplement than substitute to the other timing commands. In the case of my example in the previous post, rmsg is a simple way to get the time of the -collapse- and -tabout- command since I was interested in those commands only, but if I were interested in the time to run a group/section of commands, then I would be stuck adding up the rmsg times (or subtracting the timestamps). This could be a pain if I had to scroll through results window output or a log-file to find these timestamps. In this case, it is useful to use -timer- and then add -timer list- to the end of the do-file to get a report on how long each sub-section of interest took to run. In addition, you could write in some quick comparisons of the time to run sections of code using the stored values (e.g. di `r(t1)'/`r(t2)' ). Finally, if you were interested in how long components of programs take to run, you can use profiler to get a more detailed look. For instance, turning on profiler before running -collapse- and -tabout- would give you an output like: . profiler report collapse 7 0.201 collapse 7 0.001 GetOpStat 7 0.002 GetVarlist 14 0.000 Setnf 7 0.002 bynottar 14 0.347 _sum 0.553 Total tabout 7 0.052 tabout 6 0.050 sum_oneway 264 0.064 do_statres 6 0.033 sum_write 6 0.001 clearglobs 0.200 Total Overall total count = 359 Overall total time = 0.753 (sec) r; t=0.00 20:42:39 ~ Eric __ Eric A. Booth Public Policy Research Institute Texas A&M University ebooth@ppri.tamu.edu Office: +979.845.6754 Fax: +979.845.0249 http://ppri.tamu.edu On Jul 9, 2010, at 9:44 AM, Lachenbruch, Peter wrote: > A quick question related to this: I note that many use the timer function to get timings. I have sometimes used rmsg (set rmsg on) which gives the timing after each command. Would this be simpler? > Tony > > ________________________________________ > From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] On Behalf Of Eric Booth [ebooth@ppri.tamu.edu] > Sent: Thursday, July 08, 2010 5:36 PM > To: <statalist@hsphsun2.harvard.edu> > Subject: Re: st: Is -collapse- the Stata's fastest routine to summarize data sets? > > <> > > Tiago: > > When summarizing a large dataset, I've found the program that runs the fastest for me is -tabout- (from SSC). > I don't know enough about what's going on in the tabout adofile to know why it's faster and it may not be faster for all types of summary tables, but I when I changed from -collapse-/-contract- to -tabout- in my do-file there was a huge time savings when working with a dataset of about 60 million obs. > > For an illustration, here's a speed comparison for creating the same summary table with these 2 packages: > > ******************! > clear all > ** | change -set mem- and -expand- below to fit your system | ** > set mem 12g > sysuse auto > cap which tabout > if _rc ssc install tabout > > **create a large dataset** > expand 950000 > desc, sh > recode rep78 (.=9) > > **test collapse vs. tabout** > > // 1. collapse > ds make rep78, not > local vars `r(varlist)' > ** > timer clear 1 > timer on 1 > collapse (sum) `vars' , by(rep78) > timer off 1 > save master > > // 2. tabout > local vars: subinstr local vars " " " sum ", all > di "`vars'" > ** > timer clear 2 > timer on 2 > tabout rep78 using test.xls, replace sum c(sum `vars') > timer off 2 > > **make sure these are creating the same summary tables** > cf _all using master.dta, verbose all > ** > timer list > ******************! > > /* > timer list > 1: 240.41 / 1 = 240.4130 > 2: 0.43 / 1 = 0.4340 > */ > > 4 minutes for -collapse- versus less than a second for -tabout- summary table (using Stata 11.1 MP on Mac OS X). > Good luck. > > ~ Eric > __ > Eric A. Booth > Public Policy Research Institute > Texas A&M University > ebooth@ppri.tamu.edu > Office: +979.845.6754 > > > > On Jul 8, 2010, at 9:02 AM, Tiago V. Pereira wrote: > >> Dear Statalister, >> >> I am eager to know any faster alternatives to -collapse-, because I have >> to summarize relatively large data sets for a simulation study. -profiler- >> is telling me that most of the computation burden comes from -collapse-. >> Do you know (have) any faster alternative? Perhaps a plug-in? >> >> Thanks! >> >> Tiago >> >> * >> * For searches and help try: * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

