Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: A faster way to gsort

From	Joe Canner <[email protected]>
To	"[email protected]" <[email protected]>
Subject	RE: st: A faster way to gsort
Date	Wed, 12 Mar 2014 13:06:49 +0000

I find it very odd that Stata would provide a state-of-the-art efficient method for ascending sorts but would not be able to provide an equally efficient method for descending sorts.  (I am not doubting your benchmark, Maarten; I found a 3-fold difference in one of my large datasets.)

Generating a new variable containing -x and sorting on that works fine for numeric variables, but not for string variables.  Is there an equivalent method for strings?  Would reverse() do the trick?

Can anyone explain why -gsort- does not use an efficient method?  If nothing else, -gsort- could use Maarten's method to achieve a significant performance improvement.

Joe Canner

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maarten Buis
Sent: Wednesday, March 12, 2014 4:07 AM
To: [email protected]
Subject: Re: st: A faster way to gsort

On Tue, Mar 11, 2014 at 4:49 PM, Andrew Maurer wrote:
> I've been finding that sorting is taking up a very significant portion of computation time in code that I'm writing (eg for 6hrs run time, 5.5hrs is spent sorting). I am looking for ways to do better than Stata's sort and gsort commands.
>
> I just read about mata's use of permutation vectors (eg -help mata permutation-) and found that in at least the case below, they can be used for reverse sorting much faster than stata's -gsort-. gsort was 10x slower

The fact that sorting takes time is a well known problem in
programming in general, and Stata is no exception. A lot of effort has
been spent in devoloping smart algorithms that sort faster. Knowing
StataCorp, they are up to date on that literature and I suspect you
will have a hard time implementing something that is faster than
-sort-. The -gsort- command is a different story: I see -gsort- as a
"convience command", which will sacrifice speed for ease of use, while
I see -sort- as a "mean lean sorting machine". You can speed up your
reverse sorting by first creating a new variable containing -x and
then sorting on that new variable, as you can see in the example
below.

*------------------ begin example ------------------
clear all

program revsort
        // caller program for mata revsort()
        tempvar id
        gen long `id' = _n
        mata : revsort("`id'")
end

mata
        void revsort(string scalar id)
        // "id" should be dataset's current sort index
        // revsort() just reverses the dataset's order
        {
                st_view(V=.,.,.)

                idpos = st_varindex(id)
                V[.,.] = V[revorder(V[.,idpos]),.]
        }
end
************* End Define Programs *******

************* Run Benchmark *************
/*
benchmarking gsort vs "revsort"
(mata permutation method using revorder())
*/
tempfile temp
set obs `=1e5'
gen x = runiform()
save `temp'


forval i = 1/20 {
    use `temp', clear

    timer on 1
    gsort -x
    timer off 1
}


forval i = 1/20 {
    use `temp', clear

    timer on 2
    sort x
    revsort
    timer off 2
}

forval i = 1/20 {
    use `temp', clear

    timer on 3
    gen minusx = -x
    sort minus x
    timer off 3
}

timer list
************* End Benchmark *************
*------------------- end example -------------------
* (For more on examples I sent to the Statalist see:
* http://www.maartenbuis.nl/example_faq )


---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: st: A faster way to gsort
  - From: Joe Canner <[email protected]>

References:
- st: A faster way to gsort
  - From: Andrew Maurer <[email protected]>
- Re: st: A faster way to gsort
  - From: Maarten Buis <[email protected]>

Prev by Date: Re: st: first stage results using ivregress with vce(cluster)
Next by Date: Re: st: Why is the legend in xtline limited to 2 to 15?
Previous by thread: Re: st: A faster way to gsort
Next by thread: RE: st: A faster way to gsort
Index(es):
- Date
- Thread