Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Is there a way to use Mata to speed up within-group extrema search in Stata?

From   Billy Schwartz <>
Subject   st: Is there a way to use Mata to speed up within-group extrema search in Stata?
Date   Wed, 27 Jul 2011 13:09:20 -0400

I'm wondering if there is a way to make finding max and min with -by-
fast by using Mata. I tend to work with large datasets -- around 10gb
per size -- big enough that many of the technicalities I wasn't
supposed to worry about when I first started on Stata like variable
datatypes, how frequently I read/write to disk, etc, really begin to
matter. And I have noticed more and more that what I do with much of
my time on Stata is waiting for Stata to finish sorting, usually so
that I can find a minimum or maximum value. Stata has a really fast
-sum()- function for use with -by:- but not an equivalent -max()-
function, so you have to sort and select. Sorting algorithms, though
fast, are not as fast as extrema-finding algorithms.

For example, suppose I have panel data of bills by account and date,
and each bill has a description code for each line item on the bill
and an amount for each line item. Further, the dataset is sorted by
account date

account    date    desc    amount
1              1         1         5.95
1              1         3         2.94
1              2         1         5.95
1              2         2         9.45
1              2         3         3.00
2              3         7         6.22

If I want to identify bills that contain item with description value
2, the fastest, lowest-memory-overhead way I know to do it is

. generate byte desc2 = desc == 2
. bysort account date (desc2): replace desc2 = desc2[_N]

If there were a max function that worked like the sum function (I'm
not talking about the one Stata currently has, which doesn't work like
this), I could avoid the sort, since as I said my data is already
sorted by account date, and write merely:

. by account date: generate bye desc2 = max(desc == 2)

Mata already has a fast (built-in) function to find max and min in a
vector, which I could use on an st_view() of my dataset. But how do I
get that to work with the by: I perform in Stata?

William Schwartz
*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index