[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: speed question: collapse vs egen

From   Kit Baum <>
Subject   st: Re: speed question: collapse vs egen
Date   Sat, 26 Apr 2008 07:54:20 -0400

First of all let me say that I think the notion of machine- and operating-system-specific plugins for Stata is largely obsolete. StataCorp itself has moved heavily from development in C to development in Mata, and the avowed aim is to have virtually all of Stata "written in Stata": that is, in ado-code or Mata. Yes, it is slower than pure C, but it is also much easier to code, maintain and support. We have over the years had a couple of routines with plugins in the SSC Archive and those developers who tried to make them available for multiple platforms were going nuts. StataCorp has one of every machine that they support in-house, so they can afford to develop and distribute working C "DLLs" for every combination. Most of us do not, and code that only works on a particular platform is not IMHO very useful.

I'm sure that Bill Gould can spot those places in this code which would make it more efficient. But as should be evident Mata can do this job quite respectably, improving on pure Stata code. If I cheat and take advantage of the fact that the by-var rep78 takes on values 1,2,3,4,5, it runs about 20% faster than this. But here are my timings for Sergiy's program, where for the fourth method I have replaced his routine (which would not run on my machine anyway) with my Mata call:

. timer list
   1:     18.08 /        1 =      18.0780
   2:     16.96 /        1 =      16.9580
   3:     14.17 /        1 =      14.1700
   4:      7.70 /        1 =       7.6980

with results

                 1             2
  1 |            1        4564.5  |
  2 |            2      5967.625  |
  3 |            3   6429.233333  |
  4 |            4        6071.5  |
  5 |            5          5913  |

The Mata code and the Stata code calling it is:

void mucalc2(string scalar bv,
            string scalar vv,
            string scalar touse)
  mu = J(0, 2, .)
  st_view(X=., ., (bv, vv), touse)
  a = strtoreal(tokens(st_local("rr")))
  for(i=1; i <= cols(a); i++) {
  	mu = mu \ mean(select(X, X[.,1] :== a[i]))
  timer on 4
  mark touse
// handles missings in both price and rep78
// also not limited by Stata's matrix limits
  markout touse price rep78
  qui levelsof rep78, local(rr)
  mata: mucalc2("rep78", "price", "touse")
  timer off 4

This code may not be as fast as Sergiy's plugin (and both his code and this Mata code can doubtless be improved) but it is a hell of a lot more portable, as it will run on any machine with Stata 9.x or better. I think that development along these lines is much more in keeping with the spirit of the Stata user community.

For Mata mavens, note that my first draft made use of panelsetup() using rep78 as the panel variable. It worked, but turned in timings almost identical to that of the Stata-based methods 1,2,3.


Kit Baum, Boston College Economics and DIW Berlin
An Introduction to Modern Econometrics Using Stata:

On Apr 26, 2008, at 02:33 , Sergiy wrote:

Jeph has asked about an efficient way of creating a dataset with means
of one variable over the categories of another variable. He suggested
two possible solutions and Stas added a third one.

Below I report performance of each of these methods and compare it
with the fourth: a plugin.

I use an expanded version of auto.dta and tabulate mean {price} by
different levels of {rep78}.

1. All methods resulted in the following table of results*

    meanprice   rep78
       4564.5       1
     5967.625       2
     6429.233       3
       6071.5       4
         5913       5

2. The timing is as follows (Stata SE, Windows Server 2003, 32-bit)

   1:     33.80 /        1 =      33.7960
   2:     31.22 /        1 =      31.2190
   3:     21.33 /        1 =      21.3280
   4:      5.58 /        1 =       5.5780

3. Since the plugin was intended for similar but not exactly the same
purposes, it does some extra work (simultaneously computing
frequencies, etc), which means that this is not the ultimate record.
*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index