Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Jeph Herrin <info@flyingbuttress.net> |
To | statalist@hsphsun2.harvard.edu |
Subject | large data sets (was st: A faster way to gsort) |
Date | Wed, 12 Mar 2014 15:35:17 -0400 |
I am currently working on a series of projects where I have started to use SAS almost exclusively for most of the data management, because of the following Stata limitations:
1. Stata doesn't do well with large datasets. My datasets are on the order of 10gb; I have 24gb of RAM, but if I load a 6gb data set and do -merge- or -bysort- it quickly hits the 24gb limit and slows to a crawl. When the option is several hours vs 15 minutes in SAS, I have a strong incentive to use SAS.
2. Stata does not submit SQL well. I can submit the same query to the same database via SAS, SQL client software, and Stata, and while it works in the first two, about one in ten times Stata will simply hang and do nothing. Yesterday I waited 3 hours for Stata to return a short table (82 records) before pasting the same query into SAS and retrieving it in several minutes.
3. Stata does not do native SQL at all. That is, I would like to be able to use Stata files as tables in combination with queries to external SQL servers. SAS supports this, and this allows me to build an analytic dataset incrementally, something which is critical when using 'big data' - if I had to create the dataset at all at once, it would be on the order of 300gb.
cheers, Jeph On 3/12/2014 11:45 AM, Joe Canner wrote:
Good point. Yes, -gsort- is not very long and calls -sort- several times, but there are some gen/replace statements involved in reversing the order that take a long time in large data sets. Stata seems to be more concerned about performance in large data sets these days, so they might be receptive to improving -gsort-. Something to bring up at the next Stata Conference, if not before... -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Nick Cox Sent: Wednesday, March 12, 2014 11:38 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: A faster way to gsort On the face of it gen seq = -_n is sufficient for reversing order, but watch out: in big datasets that identifier needs to be -long-. -gsort- is an ancient command written as an .ado. That said, most of its work is done by -sort-, which is compiled C code, or so I believe. (I've never seen it, but I have every reason to suppose it to exist.) -sort- has been revisited several times over the lifetime of -gsort-. I think if people -- especially those with really big datasets -- can make a case that they are being held up by slow descending sorts, then StataCorp might be tempted to rewrite -gsort-, but negate first is the answer for now. Nick njcoxstata@gmail.com On 12 March 2014 15:25, Joe Canner <jcanner1@jhmi.edu> wrote:I think I was wrong about -gsort-. It looks like it sorts everything in ascending order first and then basically reverses the order of the resulting data set. With only a single variable, one can do this without -gsort-: sort KEY gen seq=(_N-_n)+1 sort seq In my benchmark, this takes less than half of the time compared to -gsort-. This method also solves the missing value collating problem mentioned previously and will also work with strings. However, it gets somewhat more complicated with multiple sorting variables. Since -gsort- has to account for these situations, there is a lot of overhead involved. -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Joe Canner Sent: Wednesday, March 12, 2014 10:01 AM To: statalist@hsphsun2.harvard.edu Subject: RE: st: A faster way to gsort In case anybody is still interested... I was able to answer several of my own questions on this issue: 1. Using -reverse()- on a string variable does *not* make it possible to sort the original variable in descending order. 2. It appears that -gsort- does, in fact, do something like what Maarten suggested. However, it also has to do some housekeeping after calling -sort-. I don't follow all of what is going on, but it appears to be partly associated with the need to make sure that missing values are at the top of the list after a descending sort. This is a useful thing to keep in mind if one is using Maarten's suggested workaround. Using -sort- on -x will still leave missing values at the bottom of the list, which is contrary to Stata's collating sequence. All that aside, I remain puzzled as to why -sort- cannot handle descending sorts (or, to put it another way, why there is not a super-efficient way to do descending sorts). Are the "smart" algorithms really that sensitive to the difference between ascending and descending? -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Joe Canner Sent: Wednesday, March 12, 2014 9:07 AM To: statalist@hsphsun2.harvard.edu Subject: RE: st: A faster way to gsort I find it very odd that Stata would provide a state-of-the-art efficient method for ascending sorts but would not be able to provide an equally efficient method for descending sorts. (I am not doubting your benchmark, Maarten; I found a 3-fold difference in one of my large datasets.) Generating a new variable containing -x and sorting on that works fine for numeric variables, but not for string variables. Is there an equivalent method for strings? Would reverse() do the trick? Can anyone explain why -gsort- does not use an efficient method? If nothing else, -gsort- could use Maarten's method to achieve a significant performance improvement. Joe Canner -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Maarten Buis Sent: Wednesday, March 12, 2014 4:07 AM To: statalist@hsphsun2.harvard.edu Subject: Re: st: A faster way to gsort On Tue, Mar 11, 2014 at 4:49 PM, Andrew Maurer wrote:I've been finding that sorting is taking up a very significant portion of computation time in code that I'm writing (eg for 6hrs run time, 5.5hrs is spent sorting). I am looking for ways to do better than Stata's sort and gsort commands. I just read about mata's use of permutation vectors (eg -help mata permutation-) and found that in at least the case below, they can be used for reverse sorting much faster than stata's -gsort-. gsort was 10x slowerThe fact that sorting takes time is a well known problem in programming in general, and Stata is no exception. A lot of effort has been spent in devoloping smart algorithms that sort faster. Knowing StataCorp, they are up to date on that literature and I suspect you will have a hard time implementing something that is faster than -sort-. The -gsort- command is a different story: I see -gsort- as a "convience command", which will sacrifice speed for ease of use, while I see -sort- as a "mean lean sorting machine". You can speed up your reverse sorting by first creating a new variable containing -x and then sorting on that new variable, as you can see in the example below. *------------------ begin example ------------------ clear all program revsort // caller program for mata revsort() tempvar id gen long `id' = _n mata : revsort("`id'") end mata void revsort(string scalar id) // "id" should be dataset's current sort index // revsort() just reverses the dataset's order { st_view(V=.,.,.) idpos = st_varindex(id) V[.,.] = V[revorder(V[.,idpos]),.] } end ************* End Define Programs ******* ************* Run Benchmark ************* /* benchmarking gsort vs "revsort" (mata permutation method using revorder()) */ tempfile temp set obs `=1e5' gen x = runiform() save `temp' forval i = 1/20 { use `temp', clear timer on 1 gsort -x timer off 1 } forval i = 1/20 { use `temp', clear timer on 2 sort x revsort timer off 2 } forval i = 1/20 { use `temp', clear timer on 3 gen minusx = -x sort minus x timer off 3 } timer list ************* End Benchmark ************* *------------------- end example ------------------- * (For more on examples I sent to the Statalist see: * http://www.maartenbuis.nl/example_faq ) --------------------------------- Maarten L. Buis WZB Reichpietschufer 50 10785 Berlin Germany http://www.maartenbuis.nl --------------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/
* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/