Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: A faster way to gsort


From   Joe Canner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: A faster way to gsort
Date   Wed, 12 Mar 2014 15:45:53 +0000

Good point.  Yes, -gsort- is not very long and calls -sort- several times, but there are some gen/replace statements involved in reversing the order that take a long time in large data sets.

Stata seems to be more concerned about performance in large data sets these days, so they might be receptive to improving -gsort-.  Something to bring up at the next Stata Conference, if not before...

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
Sent: Wednesday, March 12, 2014 11:38 AM
To: [email protected]
Subject: Re: st: A faster way to gsort

On the face of it

gen seq = -_n

is sufficient for reversing order, but watch out: in big datasets that
identifier needs to be -long-.

-gsort- is an ancient command written as an .ado. That said, most of
its work is done by -sort-, which is compiled C code, or so I believe.
(I've never seen it, but I have every reason to suppose it to exist.)
-sort- has been revisited several times over the lifetime of -gsort-.

I think if people -- especially those with really big datasets -- can
make a case that they are being held up by slow descending sorts, then
StataCorp might be tempted to rewrite -gsort-, but negate first is the
answer for now.
Nick
[email protected]


On 12 March 2014 15:25, Joe Canner <[email protected]> wrote:
> I think I was wrong about -gsort-.  It looks like it sorts everything in ascending order first and then basically reverses the order of the resulting data set.  With only a single variable, one can do this without -gsort-:
>
> sort KEY
> gen seq=(_N-_n)+1
> sort seq
>
> In my benchmark, this takes less than half of the time compared to -gsort-.  This method also solves the missing value collating problem mentioned previously and will also work with strings.  However, it gets somewhat more complicated with multiple sorting variables.  Since -gsort- has to account for these situations, there is a lot of overhead involved.
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
> Sent: Wednesday, March 12, 2014 10:01 AM
> To: [email protected]
> Subject: RE: st: A faster way to gsort
>
> In case anybody is still interested...
>
> I was able to answer several of my own questions on this issue:
>
> 1. Using -reverse()- on a string variable does *not* make it possible to sort the original variable in descending order.
>
> 2. It appears that -gsort- does, in fact, do something like what Maarten suggested.  However, it also has to do some  housekeeping after calling -sort-.  I don't follow all of what is going on, but it appears to be partly associated with the need to make sure that missing values are at the top of the list after a descending sort.  This is a useful thing to keep in mind if one is using Maarten's suggested workaround.  Using -sort- on -x will still leave missing values at the bottom of the list, which is contrary to Stata's collating sequence.
>
> All that aside, I remain puzzled as to why -sort- cannot handle descending sorts (or, to put it another way, why there is not a super-efficient way to do descending sorts).  Are the "smart" algorithms really that sensitive to the difference between ascending and descending?
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
> Sent: Wednesday, March 12, 2014 9:07 AM
> To: [email protected]
> Subject: RE: st: A faster way to gsort
>
> I find it very odd that Stata would provide a state-of-the-art efficient method for ascending sorts but would not be able to provide an equally efficient method for descending sorts.  (I am not doubting your benchmark, Maarten; I found a 3-fold difference in one of my large datasets.)
>
> Generating a new variable containing -x and sorting on that works fine for numeric variables, but not for string variables.  Is there an equivalent method for strings?  Would reverse() do the trick?
>
> Can anyone explain why -gsort- does not use an efficient method?  If nothing else, -gsort- could use Maarten's method to achieve a significant performance improvement.
>
> Joe Canner
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of Maarten Buis
> Sent: Wednesday, March 12, 2014 4:07 AM
> To: [email protected]
> Subject: Re: st: A faster way to gsort
>
> On Tue, Mar 11, 2014 at 4:49 PM, Andrew Maurer wrote:
>> I've been finding that sorting is taking up a very significant portion of computation time in code that I'm writing (eg for 6hrs run time, 5.5hrs is spent sorting). I am looking for ways to do better than Stata's sort and gsort commands.
>>
>> I just read about mata's use of permutation vectors (eg -help mata permutation-) and found that in at least the case below, they can be used for reverse sorting much faster than stata's -gsort-. gsort was 10x slower
>
> The fact that sorting takes time is a well known problem in
> programming in general, and Stata is no exception. A lot of effort has
> been spent in devoloping smart algorithms that sort faster. Knowing
> StataCorp, they are up to date on that literature and I suspect you
> will have a hard time implementing something that is faster than
> -sort-. The -gsort- command is a different story: I see -gsort- as a
> "convience command", which will sacrifice speed for ease of use, while
> I see -sort- as a "mean lean sorting machine". You can speed up your
> reverse sorting by first creating a new variable containing -x and
> then sorting on that new variable, as you can see in the example
> below.
>
> *------------------ begin example ------------------
> clear all
>
> program revsort
>         // caller program for mata revsort()
>         tempvar id
>         gen long `id' = _n
>         mata : revsort("`id'")
> end
>
> mata
>         void revsort(string scalar id)
>         // "id" should be dataset's current sort index
>         // revsort() just reverses the dataset's order
>         {
>                 st_view(V=.,.,.)
>
>                 idpos = st_varindex(id)
>                 V[.,.] = V[revorder(V[.,idpos]),.]
>         }
> end
> ************* End Define Programs *******
>
> ************* Run Benchmark *************
> /*
> benchmarking gsort vs "revsort"
> (mata permutation method using revorder())
> */
> tempfile temp
> set obs `=1e5'
> gen x = runiform()
> save `temp'
>
>
> forval i = 1/20 {
>     use `temp', clear
>
>     timer on 1
>     gsort -x
>     timer off 1
> }
>
>
> forval i = 1/20 {
>     use `temp', clear
>
>     timer on 2
>     sort x
>     revsort
>     timer off 2
> }
>
> forval i = 1/20 {
>     use `temp', clear
>
>     timer on 3
>     gen minusx = -x
>     sort minus x
>     timer off 3
> }
>
> timer list
> ************* End Benchmark *************
> *------------------- end example -------------------
> * (For more on examples I sent to the Statalist see:
> * http://www.maartenbuis.nl/example_faq )
>
>
> ---------------------------------
> Maarten L. Buis
> WZB
> Reichpietschufer 50
> 10785 Berlin
> Germany
>
> http://www.maartenbuis.nl
> ---------------------------------
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index