Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Is -collapse- the Stata's fastest routine to summarize data sets?


From   "Lachenbruch, Peter" <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: Is -collapse- the Stata's fastest routine to summarize data sets?
Date   Fri, 9 Jul 2010 07:44:53 -0700

A quick question related to this: I note that many use the timer function to get timings.  I have sometimes used rmsg (set rmsg on) which gives the timing after each command.  Would this be simpler?
Tony

________________________________________
From: [email protected] [[email protected]] On Behalf Of Eric Booth [[email protected]]
Sent: Thursday, July 08, 2010 5:36 PM
To: <[email protected]>
Subject: Re: st: Is -collapse- the Stata's fastest routine to summarize data sets?

<>

Tiago:

When summarizing a large dataset, I've found the program that runs the fastest for me is -tabout- (from SSC).
I don't know enough about what's going on in the tabout adofile to know why it's faster and it may not be faster for all types of summary tables, but I when I changed from -collapse-/-contract- to -tabout- in my do-file there was a huge time savings when working with a dataset of about 60 million obs.

For an illustration, here's a speed comparison for creating the same summary table with these 2 packages:

******************!
clear all
** |  change -set mem- and -expand- below to fit your system  | **
set mem 12g
sysuse auto
        cap which tabout
        if _rc ssc install tabout

**create a large dataset**
expand 950000
desc, sh
        recode rep78 (.=9)

**test collapse vs. tabout**

//  1. collapse
ds make rep78, not
local vars `r(varlist)'
**
timer clear 1
timer on 1
collapse (sum) `vars'  , by(rep78)
timer off 1
save master

//  2.  tabout
local vars: subinstr local vars " " " sum ", all
di "`vars'"
**
timer clear 2
timer on 2
tabout rep78 using test.xls, replace sum c(sum `vars')
timer off 2

**make sure these are creating the same summary tables**
cf _all using master.dta, verbose all
**
timer list
******************!

/*
  timer list
   1:    240.41 /        1 =     240.4130
   2:      0.43 /        1 =       0.4340
*/

4 minutes for -collapse- versus less than a second for -tabout- summary table (using Stata 11.1 MP on Mac OS X).
Good luck.

~ Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754



On Jul 8, 2010, at 9:02 AM, Tiago V. Pereira wrote:

> Dear Statalister,
>
> I am eager to know any faster alternatives to -collapse-, because I have
> to summarize relatively large data sets for a simulation study. -profiler-
> is telling me that most of the computation burden comes from -collapse-.
> Do you know (have) any faster alternative? Perhaps a plug-in?
>
> Thanks!
>
> Tiago
>
> *
> *   For searches and help try:



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index