Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?


From   Eric Booth <ebooth@ppri.tamu.edu>
To   "<statalist@hsphsun2.harvard.edu>" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?
Date   Sat, 10 Jul 2010 01:53:36 +0000

<>


Hi Tony:

I have rmsg on permanently, so I think in that sense this is simpler to use.  I also like how, in addition to showing the time it takes to run each command & an entire do-file takes to run, rmsg also displays the actual timestamp for when a command completes--which can be useful when running something overnight.  

That being said, rmsg is often a better supplement than  substitute to the other timing commands.    In the case of my example in the previous post, rmsg is a simple way to get the time of the -collapse- and -tabout- command since I was interested in those commands only, but if I were interested in the time to run a group/section of commands, then I would be stuck adding up the rmsg times (or subtracting the timestamps).  This could be a pain if I had to scroll through results window output or a log-file to find these timestamps.  In this case, it is useful to use -timer- and then add  -timer list- to the end of the do-file to get a report on how long each sub-section of interest took to run.  In addition, you could write in some quick comparisons of the time to run sections of code using the stored values  (e.g.    di  `r(t1)'/`r(t2)'  ).

Finally, if you were interested in how long components of programs take to run, you can use profiler to get a more detailed look.  For instance, turning on profiler before running -collapse- and -tabout- would give you an output like:

.   profiler report
collapse
     7    0.201  collapse
     7    0.001  GetOpStat
     7    0.002  GetVarlist
    14    0.000  Setnf
     7    0.002  bynottar
    14    0.347  _sum
          0.553  Total
tabout
     7    0.052  tabout
     6    0.050  sum_oneway
   264    0.064  do_statres
     6    0.033  sum_write
     6    0.001  clearglobs
          0.200  Total
Overall total count =    359
Overall total time  =      0.753 (sec)
r; t=0.00 20:42:39

~ Eric

__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
ebooth@ppri.tamu.edu
Office: +979.845.6754
Fax: +979.845.0249
http://ppri.tamu.edu




On Jul 9, 2010, at 9:44 AM, Lachenbruch, Peter wrote:

> A quick question related to this: I note that many use the timer function to get timings.  I have sometimes used rmsg (set rmsg on) which gives the timing after each command.  Would this be simpler?
> Tony
> 
> ________________________________________
> From: owner-statalist@hsphsun2.harvard.edu [owner-statalist@hsphsun2.harvard.edu] On Behalf Of Eric Booth [ebooth@ppri.tamu.edu]
> Sent: Thursday, July 08, 2010 5:36 PM
> To: <statalist@hsphsun2.harvard.edu>
> Subject: Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?
> 
> <>
> 
> Tiago:
> 
> When summarizing a large dataset, I've found the program that runs the fastest for me is -tabout- (from SSC).
> I don't know enough about what's going on in the tabout adofile to know why it's faster and it may not be faster for all types of summary tables, but I when I changed from -collapse-/-contract- to -tabout- in my do-file there was a huge time savings when working with a dataset of about 60 million obs.
> 
> For an illustration, here's a speed comparison for creating the same summary table with these 2 packages:
> 
> ******************!
> clear all
> ** |  change -set mem- and -expand- below to fit your system  | **
> set mem 12g
> sysuse auto
>        cap which tabout
>        if _rc ssc install tabout
> 
> **create a large dataset**
> expand 950000
> desc, sh
>        recode rep78 (.=9)
> 
> **test collapse vs. tabout**
> 
> //  1. collapse
> ds make rep78, not
> local vars `r(varlist)'
> **
> timer clear 1
> timer on 1
> collapse (sum) `vars'  , by(rep78)
> timer off 1
> save master
> 
> //  2.  tabout
> local vars: subinstr local vars " " " sum ", all
> di "`vars'"
> **
> timer clear 2
> timer on 2
> tabout rep78 using test.xls, replace sum c(sum `vars')
> timer off 2
> 
> **make sure these are creating the same summary tables**
> cf _all using master.dta, verbose all
> **
> timer list
> ******************!
> 
> /*
>  timer list
>   1:    240.41 /        1 =     240.4130
>   2:      0.43 /        1 =       0.4340
> */
> 
> 4 minutes for -collapse- versus less than a second for -tabout- summary table (using Stata 11.1 MP on Mac OS X).
> Good luck.
> 
> ~ Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> ebooth@ppri.tamu.edu
> Office: +979.845.6754
> 
> 
> 
> On Jul 8, 2010, at 9:02 AM, Tiago V. Pereira wrote:
> 
>> Dear Statalister,
>> 
>> I am eager to know any faster alternatives to -collapse-, because I have
>> to summarize relatively large data sets for a simulation study. -profiler-
>> is telling me that most of the computation burden comes from -collapse-.
>> Do you know (have) any faster alternative? Perhaps a plug-in?
>> 
>> Thanks!
>> 
>> Tiago
>> 
>> *
>> *   For searches and help try:



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index