Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: A faster way to gsort


From   Joe Canner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: st: A faster way to gsort
Date   Wed, 12 Mar 2014 14:01:03 +0000

In case anybody is still interested...

I was able to answer several of my own questions on this issue:

1. Using -reverse()- on a string variable does *not* make it possible to sort the original variable in descending order.

2. It appears that -gsort- does, in fact, do something like what Maarten suggested.  However, it also has to do some  housekeeping after calling -sort-.  I don't follow all of what is going on, but it appears to be partly associated with the need to make sure that missing values are at the top of the list after a descending sort.  This is a useful thing to keep in mind if one is using Maarten's suggested workaround.  Using -sort- on -x will still leave missing values at the bottom of the list, which is contrary to Stata's collating sequence.

All that aside, I remain puzzled as to why -sort- cannot handle descending sorts (or, to put it another way, why there is not a super-efficient way to do descending sorts).  Are the "smart" algorithms really that sensitive to the difference between ascending and descending?

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
Sent: Wednesday, March 12, 2014 9:07 AM
To: [email protected]
Subject: RE: st: A faster way to gsort

I find it very odd that Stata would provide a state-of-the-art efficient method for ascending sorts but would not be able to provide an equally efficient method for descending sorts.  (I am not doubting your benchmark, Maarten; I found a 3-fold difference in one of my large datasets.)

Generating a new variable containing -x and sorting on that works fine for numeric variables, but not for string variables.  Is there an equivalent method for strings?  Would reverse() do the trick?

Can anyone explain why -gsort- does not use an efficient method?  If nothing else, -gsort- could use Maarten's method to achieve a significant performance improvement.

Joe Canner

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maarten Buis
Sent: Wednesday, March 12, 2014 4:07 AM
To: [email protected]
Subject: Re: st: A faster way to gsort

On Tue, Mar 11, 2014 at 4:49 PM, Andrew Maurer wrote:
> I've been finding that sorting is taking up a very significant portion of computation time in code that I'm writing (eg for 6hrs run time, 5.5hrs is spent sorting). I am looking for ways to do better than Stata's sort and gsort commands.
>
> I just read about mata's use of permutation vectors (eg -help mata permutation-) and found that in at least the case below, they can be used for reverse sorting much faster than stata's -gsort-. gsort was 10x slower

The fact that sorting takes time is a well known problem in
programming in general, and Stata is no exception. A lot of effort has
been spent in devoloping smart algorithms that sort faster. Knowing
StataCorp, they are up to date on that literature and I suspect you
will have a hard time implementing something that is faster than
-sort-. The -gsort- command is a different story: I see -gsort- as a
"convience command", which will sacrifice speed for ease of use, while
I see -sort- as a "mean lean sorting machine". You can speed up your
reverse sorting by first creating a new variable containing -x and
then sorting on that new variable, as you can see in the example
below.

*------------------ begin example ------------------
clear all

program revsort
        // caller program for mata revsort()
        tempvar id
        gen long `id' = _n
        mata : revsort("`id'")
end

mata
        void revsort(string scalar id)
        // "id" should be dataset's current sort index
        // revsort() just reverses the dataset's order
        {
                st_view(V=.,.,.)

                idpos = st_varindex(id)
                V[.,.] = V[revorder(V[.,idpos]),.]
        }
end
************* End Define Programs *******

************* Run Benchmark *************
/*
benchmarking gsort vs "revsort"
(mata permutation method using revorder())
*/
tempfile temp
set obs `=1e5'
gen x = runiform()
save `temp'


forval i = 1/20 {
    use `temp', clear

    timer on 1
    gsort -x
    timer off 1
}


forval i = 1/20 {
    use `temp', clear

    timer on 2
    sort x
    revsort
    timer off 2
}

forval i = 1/20 {
    use `temp', clear

    timer on 3
    gen minusx = -x
    sort minus x
    timer off 3
}

timer list
************* End Benchmark *************
*------------------- end example -------------------
* (For more on examples I sent to the Statalist see:
* http://www.maartenbuis.nl/example_faq )


---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index