Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Joe Canner <jcanner1@jhmi.edu> |
To | "statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |
Subject | RE: st: A faster way to gsort |
Date | Wed, 12 Mar 2014 13:06:49 +0000 |
I find it very odd that Stata would provide a state-of-the-art efficient method for ascending sorts but would not be able to provide an equally efficient method for descending sorts. (I am not doubting your benchmark, Maarten; I found a 3-fold difference in one of my large datasets.) Generating a new variable containing -x and sorting on that works fine for numeric variables, but not for string variables. Is there an equivalent method for strings? Would reverse() do the trick? Can anyone explain why -gsort- does not use an efficient method? If nothing else, -gsort- could use Maarten's method to achieve a significant performance improvement. Joe Canner -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Maarten Buis Sent: Wednesday, March 12, 2014 4:07 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: A faster way to gsort On Tue, Mar 11, 2014 at 4:49 PM, Andrew Maurer wrote: > I've been finding that sorting is taking up a very significant portion of computation time in code that I'm writing (eg for 6hrs run time, 5.5hrs is spent sorting). I am looking for ways to do better than Stata's sort and gsort commands. > > I just read about mata's use of permutation vectors (eg -help mata permutation-) and found that in at least the case below, they can be used for reverse sorting much faster than stata's -gsort-. gsort was 10x slower The fact that sorting takes time is a well known problem in programming in general, and Stata is no exception. A lot of effort has been spent in devoloping smart algorithms that sort faster. Knowing StataCorp, they are up to date on that literature and I suspect you will have a hard time implementing something that is faster than -sort-. The -gsort- command is a different story: I see -gsort- as a "convience command", which will sacrifice speed for ease of use, while I see -sort- as a "mean lean sorting machine". You can speed up your reverse sorting by first creating a new variable containing -x and then sorting on that new variable, as you can see in the example below. *------------------ begin example ------------------ clear all program revsort // caller program for mata revsort() tempvar id gen long `id' = _n mata : revsort("`id'") end mata void revsort(string scalar id) // "id" should be dataset's current sort index // revsort() just reverses the dataset's order { st_view(V=.,.,.) idpos = st_varindex(id) V[.,.] = V[revorder(V[.,idpos]),.] } end ************* End Define Programs ******* ************* Run Benchmark ************* /* benchmarking gsort vs "revsort" (mata permutation method using revorder()) */ tempfile temp set obs `=1e5' gen x = runiform() save `temp' forval i = 1/20 { use `temp', clear timer on 1 gsort -x timer off 1 } forval i = 1/20 { use `temp', clear timer on 2 sort x revsort timer off 2 } forval i = 1/20 { use `temp', clear timer on 3 gen minusx = -x sort minus x timer off 3 } timer list ************* End Benchmark ************* *------------------- end example ------------------- * (For more on examples I sent to the Statalist see: * http://www.maartenbuis.nl/example_faq ) --------------------------------- Maarten L. Buis WZB Reichpietschufer 50 10785 Berlin Germany http://www.maartenbuis.nl --------------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/