Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

large data sets (was st: A faster way to gsort)


From   Jeph Herrin <[email protected]>
To   [email protected]
Subject   large data sets (was st: A faster way to gsort)
Date   Wed, 12 Mar 2014 15:35:17 -0400

Joe claims that Stata seems to be more concerned about performance in large data sets these days, and I would like to comment (for those at StataCorp who are paying attention, and anyone else who might concur or advise) that for me this is at the very top of my wish list for Stata 14.

I am currently working on a series of projects where I have started to use SAS almost exclusively for most of the data management, because of the following Stata limitations:

1. Stata doesn't do well with large datasets. My datasets are on the order of 10gb; I have 24gb of RAM, but if I load a 6gb data set and do -merge- or -bysort- it quickly hits the 24gb limit and slows to a crawl. When the option is several hours vs 15 minutes in SAS, I have a strong incentive to use SAS.

2. Stata does not submit SQL well. I can submit the same query to the same database via SAS, SQL client software, and Stata, and while it works in the first two, about one in ten times Stata will simply hang and do nothing. Yesterday I waited 3 hours for Stata to return a short table (82 records) before pasting the same query into SAS and retrieving it in several minutes.

3. Stata does not do native SQL at all. That is, I would like to be able to use Stata files as tables in combination with queries to external SQL servers. SAS supports this, and this allows me to build an analytic dataset incrementally, something which is critical when using 'big data' - if I had to create the dataset at all at once, it would be on the order of 300gb.


cheers,
Jeph



On 3/12/2014 11:45 AM, Joe Canner wrote:
Good point.  Yes, -gsort- is not very long and calls -sort- several times, but there are some gen/replace statements involved in reversing the order that take a long time in large data sets.

Stata seems to be more concerned about performance in large data sets these days, so they might be receptive to improving -gsort-.  Something to bring up at the next Stata Conference, if not before...

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Nick Cox
Sent: Wednesday, March 12, 2014 11:38 AM
To: [email protected]
Subject: Re: st: A faster way to gsort

On the face of it

gen seq = -_n

is sufficient for reversing order, but watch out: in big datasets that
identifier needs to be -long-.

-gsort- is an ancient command written as an .ado. That said, most of
its work is done by -sort-, which is compiled C code, or so I believe.
(I've never seen it, but I have every reason to suppose it to exist.)
-sort- has been revisited several times over the lifetime of -gsort-.

I think if people -- especially those with really big datasets -- can
make a case that they are being held up by slow descending sorts, then
StataCorp might be tempted to rewrite -gsort-, but negate first is the
answer for now.
Nick
[email protected]


On 12 March 2014 15:25, Joe Canner <[email protected]> wrote:
I think I was wrong about -gsort-.  It looks like it sorts everything in ascending order first and then basically reverses the order of the resulting data set.  With only a single variable, one can do this without -gsort-:

sort KEY
gen seq=(_N-_n)+1
sort seq

In my benchmark, this takes less than half of the time compared to -gsort-.  This method also solves the missing value collating problem mentioned previously and will also work with strings.  However, it gets somewhat more complicated with multiple sorting variables.  Since -gsort- has to account for these situations, there is a lot of overhead involved.

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
Sent: Wednesday, March 12, 2014 10:01 AM
To: [email protected]
Subject: RE: st: A faster way to gsort

In case anybody is still interested...

I was able to answer several of my own questions on this issue:

1. Using -reverse()- on a string variable does *not* make it possible to sort the original variable in descending order.

2. It appears that -gsort- does, in fact, do something like what Maarten suggested.  However, it also has to do some  housekeeping after calling -sort-.  I don't follow all of what is going on, but it appears to be partly associated with the need to make sure that missing values are at the top of the list after a descending sort.  This is a useful thing to keep in mind if one is using Maarten's suggested workaround.  Using -sort- on -x will still leave missing values at the bottom of the list, which is contrary to Stata's collating sequence.

All that aside, I remain puzzled as to why -sort- cannot handle descending sorts (or, to put it another way, why there is not a super-efficient way to do descending sorts).  Are the "smart" algorithms really that sensitive to the difference between ascending and descending?

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Joe Canner
Sent: Wednesday, March 12, 2014 9:07 AM
To: [email protected]
Subject: RE: st: A faster way to gsort

I find it very odd that Stata would provide a state-of-the-art efficient method for ascending sorts but would not be able to provide an equally efficient method for descending sorts.  (I am not doubting your benchmark, Maarten; I found a 3-fold difference in one of my large datasets.)

Generating a new variable containing -x and sorting on that works fine for numeric variables, but not for string variables.  Is there an equivalent method for strings?  Would reverse() do the trick?

Can anyone explain why -gsort- does not use an efficient method?  If nothing else, -gsort- could use Maarten's method to achieve a significant performance improvement.

Joe Canner

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Maarten Buis
Sent: Wednesday, March 12, 2014 4:07 AM
To: [email protected]
Subject: Re: st: A faster way to gsort

On Tue, Mar 11, 2014 at 4:49 PM, Andrew Maurer wrote:
I've been finding that sorting is taking up a very significant portion of computation time in code that I'm writing (eg for 6hrs run time, 5.5hrs is spent sorting). I am looking for ways to do better than Stata's sort and gsort commands.

I just read about mata's use of permutation vectors (eg -help mata permutation-) and found that in at least the case below, they can be used for reverse sorting much faster than stata's -gsort-. gsort was 10x slower

The fact that sorting takes time is a well known problem in
programming in general, and Stata is no exception. A lot of effort has
been spent in devoloping smart algorithms that sort faster. Knowing
StataCorp, they are up to date on that literature and I suspect you
will have a hard time implementing something that is faster than
-sort-. The -gsort- command is a different story: I see -gsort- as a
"convience command", which will sacrifice speed for ease of use, while
I see -sort- as a "mean lean sorting machine". You can speed up your
reverse sorting by first creating a new variable containing -x and
then sorting on that new variable, as you can see in the example
below.

*------------------ begin example ------------------
clear all

program revsort
         // caller program for mata revsort()
         tempvar id
         gen long `id' = _n
         mata : revsort("`id'")
end

mata
         void revsort(string scalar id)
         // "id" should be dataset's current sort index
         // revsort() just reverses the dataset's order
         {
                 st_view(V=.,.,.)

                 idpos = st_varindex(id)
                 V[.,.] = V[revorder(V[.,idpos]),.]
         }
end
************* End Define Programs *******

************* Run Benchmark *************
/*
benchmarking gsort vs "revsort"
(mata permutation method using revorder())
*/
tempfile temp
set obs `=1e5'
gen x = runiform()
save `temp'


forval i = 1/20 {
     use `temp', clear

     timer on 1
     gsort -x
     timer off 1
}


forval i = 1/20 {
     use `temp', clear

     timer on 2
     sort x
     revsort
     timer off 2
}

forval i = 1/20 {
     use `temp', clear

     timer on 3
     gen minusx = -x
     sort minus x
     timer off 3
}

timer list
************* End Benchmark *************
*------------------- end example -------------------
* (For more on examples I sent to the Statalist see:
* http://www.maartenbuis.nl/example_faq )


---------------------------------
Maarten L. Buis
WZB
Reichpietschufer 50
10785 Berlin
Germany

http://www.maartenbuis.nl
---------------------------------
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index