Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?


From   Billy Schwartz <[email protected]>
To   [email protected]
Subject   Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?
Date   Thu, 28 Jul 2011 10:41:15 -0400

Thanks, Mrs. Cox and Picard. It's interesting that the egen = max()
and collapse commands use the sorting method (at least as of version
11.2).

I should add that in private e-mails Mr. Picard and I discussed the
following two more general methods for finding the maximum of a
variable that might have missing values. I haven't finished testing
them so I can't promise they work as I expect them to, but it captures
the same speed savings it gives the right result.

//introduce some missing values into desc
replace desc = . if desc == int(10 * runiform())

//first method (mine)
clonevar first = desc
by account date: replace first = first[_n-1] if first[_n-1] < . &
(first[_n-1] > first | first >= .)
by account date: replace first = first[_N]

//second (Mr. Picard)
clonevar dmax = desc
by account date: replace dmax = dmax[_n-1] if mi(dmax) & !mi(dmax[_n-1])
by account date: replace dmax = dmax[_n-1] if dmax[_n-1]>dmax & !mi(dmax[_n-1])
by account date: replace dmax = dmax[_N]

He also suggests using dmax<. rather than mi(dmax) to avoid function overhead.


On Thu, Jul 28, 2011 at 5:47 AM, Nick Cox <[email protected]> wrote:
>
> For the record, this is a little slower. Calling up -max(,)- has some
> overhead over doing it directly as Robert does. But it is still much
> faster than the method with re-sort-ing.
>
> * third method
> clonevar mx2 = desc2
> by account date : replace mx2 = max(mx2[_n-1], desc2)
> by account date: replace mx2 = mx2[_N]
>
> Nick
>
> On Wed, Jul 27, 2011 at 9:34 PM, Robert Picard <[email protected]> wrote:
> > You don't need to sort again to find the min or max within account
> > date groups. Here's an example of how to find the maximum:
> >
> > * --------------------- begin example ---------------------
> > version 11
> > clear
> >
> > * make up some data
> > set obs 1000000
> > set seed 12345
> > gen account = int(10000 * runiform())
> > gen date = int(100 * runiform())
> > gen desc = int(10 * runiform())
> > gen random = runiform()
> >
> > * assume that the data is sorted by account date
> > sort account date random
> >
> > set rmsg on
> >
> > * identify account date groups with desc == 2
> > generate desc2 = desc == 2
> >
> > * no additional sorting needed
> > clonevar mx = desc2
> > by account date: replace mx = mx[_n-1] if mx[_n-1] > mx & _n > 1
> > by account date: replace mx = mx[_N]
> >
> > * using sort is slower
> > bysort account date (desc2): replace desc2 = desc2[_N]
> >
> > assert desc2 == mx
> >
> > * --------------------- end example -----------------------
> >
> >
> > On Wed, Jul 27, 2011 at 1:09 PM, Billy Schwartz <[email protected]> wrote:
> >> I'm wondering if there is a way to make finding max and min with -by-
> >> fast by using Mata. I tend to work with large datasets -- around 10gb
> >> per size -- big enough that many of the technicalities I wasn't
> >> supposed to worry about when I first started on Stata like variable
> >> datatypes, how frequently I read/write to disk, etc, really begin to
> >> matter. And I have noticed more and more that what I do with much of
> >> my time on Stata is waiting for Stata to finish sorting, usually so
> >> that I can find a minimum or maximum value. Stata has a really fast
> >> -sum()- function for use with -by:- but not an equivalent -max()-
> >> function, so you have to sort and select. Sorting algorithms, though
> >> fast, are not as fast as extrema-finding algorithms.
> >>
> >> For example, suppose I have panel data of bills by account and date,
> >> and each bill has a description code for each line item on the bill
> >> and an amount for each line item. Further, the dataset is sorted by
> >> account date
> >>
> >> account    date    desc    amount
> >> -----------------------------------------------
> >> 1              1         1         5.95
> >> 1              1         3         2.94
> >> 1              2         1         5.95
> >> 1              2         2         9.45
> >> 1              2         3         3.00
> >> 2              3         7         6.22
> >> [etc]
> >>
> >> If I want to identify bills that contain item with description value
> >> 2, the fastest, lowest-memory-overhead way I know to do it is
> >>
> >> . generate byte desc2 = desc == 2
> >> . bysort account date (desc2): replace desc2 = desc2[_N]
> >>
> >> If there were a max function that worked like the sum function (I'm
> >> not talking about the one Stata currently has, which doesn't work like
> >> this), I could avoid the sort, since as I said my data is already
> >> sorted by account date, and write merely:
> >>
> >> . by account date: generate bye desc2 = max(desc == 2)
> >>
> >> Mata already has a fast (built-in) function to find max and min in a
> >> vector, which I could use on an st_view() of my dataset. But how do I
> >> get that to work with the by: I perform in Stata?
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index