Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Billy Schwartz <wkschwartz@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata? |

Date |
Thu, 28 Jul 2011 10:41:15 -0400 |

Thanks, Mrs. Cox and Picard. It's interesting that the egen = max() and collapse commands use the sorting method (at least as of version 11.2). I should add that in private e-mails Mr. Picard and I discussed the following two more general methods for finding the maximum of a variable that might have missing values. I haven't finished testing them so I can't promise they work as I expect them to, but it captures the same speed savings it gives the right result. //introduce some missing values into desc replace desc = . if desc == int(10 * runiform()) //first method (mine) clonevar first = desc by account date: replace first = first[_n-1] if first[_n-1] < . & (first[_n-1] > first | first >= .) by account date: replace first = first[_N] //second (Mr. Picard) clonevar dmax = desc by account date: replace dmax = dmax[_n-1] if mi(dmax) & !mi(dmax[_n-1]) by account date: replace dmax = dmax[_n-1] if dmax[_n-1]>dmax & !mi(dmax[_n-1]) by account date: replace dmax = dmax[_N] He also suggests using dmax<. rather than mi(dmax) to avoid function overhead. On Thu, Jul 28, 2011 at 5:47 AM, Nick Cox <njcoxstata@gmail.com> wrote: > > For the record, this is a little slower. Calling up -max(,)- has some > overhead over doing it directly as Robert does. But it is still much > faster than the method with re-sort-ing. > > * third method > clonevar mx2 = desc2 > by account date : replace mx2 = max(mx2[_n-1], desc2) > by account date: replace mx2 = mx2[_N] > > Nick > > On Wed, Jul 27, 2011 at 9:34 PM, Robert Picard <picard@netbox.com> wrote: > > You don't need to sort again to find the min or max within account > > date groups. Here's an example of how to find the maximum: > > > > * --------------------- begin example --------------------- > > version 11 > > clear > > > > * make up some data > > set obs 1000000 > > set seed 12345 > > gen account = int(10000 * runiform()) > > gen date = int(100 * runiform()) > > gen desc = int(10 * runiform()) > > gen random = runiform() > > > > * assume that the data is sorted by account date > > sort account date random > > > > set rmsg on > > > > * identify account date groups with desc == 2 > > generate desc2 = desc == 2 > > > > * no additional sorting needed > > clonevar mx = desc2 > > by account date: replace mx = mx[_n-1] if mx[_n-1] > mx & _n > 1 > > by account date: replace mx = mx[_N] > > > > * using sort is slower > > bysort account date (desc2): replace desc2 = desc2[_N] > > > > assert desc2 == mx > > > > * --------------------- end example ----------------------- > > > > > > On Wed, Jul 27, 2011 at 1:09 PM, Billy Schwartz <wkschwartz@gmail.com> wrote: > >> I'm wondering if there is a way to make finding max and min with -by- > >> fast by using Mata. I tend to work with large datasets -- around 10gb > >> per size -- big enough that many of the technicalities I wasn't > >> supposed to worry about when I first started on Stata like variable > >> datatypes, how frequently I read/write to disk, etc, really begin to > >> matter. And I have noticed more and more that what I do with much of > >> my time on Stata is waiting for Stata to finish sorting, usually so > >> that I can find a minimum or maximum value. Stata has a really fast > >> -sum()- function for use with -by:- but not an equivalent -max()- > >> function, so you have to sort and select. Sorting algorithms, though > >> fast, are not as fast as extrema-finding algorithms. > >> > >> For example, suppose I have panel data of bills by account and date, > >> and each bill has a description code for each line item on the bill > >> and an amount for each line item. Further, the dataset is sorted by > >> account date > >> > >> account date desc amount > >> ----------------------------------------------- > >> 1 1 1 5.95 > >> 1 1 3 2.94 > >> 1 2 1 5.95 > >> 1 2 2 9.45 > >> 1 2 3 3.00 > >> 2 3 7 6.22 > >> [etc] > >> > >> If I want to identify bills that contain item with description value > >> 2, the fastest, lowest-memory-overhead way I know to do it is > >> > >> . generate byte desc2 = desc == 2 > >> . bysort account date (desc2): replace desc2 = desc2[_N] > >> > >> If there were a max function that worked like the sum function (I'm > >> not talking about the one Stata currently has, which doesn't work like > >> this), I could avoid the sort, since as I said my data is already > >> sorted by account date, and write merely: > >> > >> . by account date: generate bye desc2 = max(desc == 2) > >> > >> Mata already has a fast (built-in) function to find max and min in a > >> vector, which I could use on an st_view() of my dataset. But how do I > >> get that to work with the by: I perform in Stata? > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**st: Is there a way to use Mata to speed up within-group extrema search in Stata?***From:*Billy Schwartz <wkschwartz@gmail.com>

**Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?***From:*Robert Picard <picard@netbox.com>

**Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?***From:*Nick Cox <njcoxstata@gmail.com>

- Prev by Date:
**Re: AW: st: Cox Competing Risks Model (Cox CRM)** - Next by Date:
**RE: st: RE: Use 2 variables to gen 10 new variables** - Previous by thread:
**Re: st: Is there a way to use Mata to speed up within-group extrema search in Stata?** - Next by thread:
**st: Use 2 variables to gen 10 new variables** - Index(es):