Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?


From   Robert Picard <[email protected]>
To   [email protected]
Subject   Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?
Date   Wed, 27 Jul 2011 16:34:23 -0400

You don't need to sort again to find the min or max within account
date groups. Here's an example of how to find the maximum:

* --------------------- begin example ---------------------
version 11
clear

* make up some data
set obs 1000000
set seed 12345
gen account = int(10000 * runiform())
gen date = int(100 * runiform())
gen desc = int(10 * runiform())
gen random = runiform()

* assume that the data is sorted by account date
sort account date random

set rmsg on

* identify account date groups with desc == 2
generate desc2 = desc == 2

* no additional sorting needed
clonevar mx = desc2
by account date: replace mx = mx[_n-1] if mx[_n-1] > mx & _n > 1
by account date: replace mx = mx[_N]

* using sort is slower
bysort account date (desc2): replace desc2 = desc2[_N]

assert desc2 == mx

* --------------------- end example -----------------------


On Wed, Jul 27, 2011 at 1:09 PM, Billy Schwartz <[email protected]> wrote:
> I'm wondering if there is a way to make finding max and min with -by-
> fast by using Mata. I tend to work with large datasets -- around 10gb
> per size -- big enough that many of the technicalities I wasn't
> supposed to worry about when I first started on Stata like variable
> datatypes, how frequently I read/write to disk, etc, really begin to
> matter. And I have noticed more and more that what I do with much of
> my time on Stata is waiting for Stata to finish sorting, usually so
> that I can find a minimum or maximum value. Stata has a really fast
> -sum()- function for use with -by:- but not an equivalent -max()-
> function, so you have to sort and select. Sorting algorithms, though
> fast, are not as fast as extrema-finding algorithms.
>
> For example, suppose I have panel data of bills by account and date,
> and each bill has a description code for each line item on the bill
> and an amount for each line item. Further, the dataset is sorted by
> account date
>
> account    date    desc    amount
> -----------------------------------------------
> 1              1         1         5.95
> 1              1         3         2.94
> 1              2         1         5.95
> 1              2         2         9.45
> 1              2         3         3.00
> 2              3         7         6.22
> [etc]
>
> If I want to identify bills that contain item with description value
> 2, the fastest, lowest-memory-overhead way I know to do it is
>
> . generate byte desc2 = desc == 2
> . bysort account date (desc2): replace desc2 = desc2[_N]
>
> If there were a max function that worked like the sum function (I'm
> not talking about the one Stata currently has, which doesn't work like
> this), I could avoid the sort, since as I said my data is already
> sorted by account date, and write merely:
>
> . by account date: generate bye desc2 = max(desc == 2)
>
> Mata already has a fast (built-in) function to find max and min in a
> vector, which I could use on an st_view() of my dataset. But how do I
> get that to work with the by: I perform in Stata?
>
> --
> William Schwartz
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index