Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: large data sets (was st: A faster way to gsort)


From   Joe Canner <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: large data sets (was st: A faster way to gsort)
Date   Thu, 13 Mar 2014 15:45:28 +0000

Andrew,

I tried to reproduce your situation with a big data set that I use, and I found a 25% reduction in time when the data set had been previously sorted using some of the same variables, as opposed to starting with a random sort.  

However, I can easily imagine that this depends very much on how your data are structured.  In your example (sorting by country and region and then later by region), if the number of regions per country is small, then less sorting needs to be done, and vice versa.  In the extreme, if there is one region per country, then your data set needs almost no sorting, while if each observation has a different region, you are not much better off than starting from a random order.

Perhaps you are asking (as was discussed in your thread from last Thursday) why Stata can't use -sortedby- information to guide a sort.  I don't know enough about sorting algorithms to know how this might work (if at all), but I'm quite sure that any normal sort algorithm will take less time if the data set is already (partially) sorted. It makes logical sense based on how these algorithms usually work and  I verified it above trying to reproduce your situation, as well as verifying it last week by turning off the -sortedby- information (by changing some values but not changing the sort order) before sorting.

Unfortunately, this doesn't necessarily help you much, but perhaps you can experiment with how you order your tasks to get better efficiency.  Or, perhaps you can, for example, redesign tasks so that a different sort order will accomplish the same thing (e.g., do you really need mean age by country, region, and year; or will region, country, and year work?).

Regards,
Joe

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Maurer
Sent: Thursday, March 13, 2014 11:09 AM
To: [email protected]
Subject: RE: large data sets (was st: A faster way to gsort)

Thanks for all the response to this thread.

Maarten,
I'm sure that Stata's sort algorithm is very strong in terms of sorting random data. In my case, in 5 hours spent sorting, the time is broken down into approximately 5 1-hour sorts.  Although I don't know what's going on behind the scenes, it seems to me like Stata is not making use of previous sort order when sorting.

For example, my data might be panel data, including: country, region, person_id, year, age, and birthyear (birthyear = year - age). I want to get some summary information of the data, say:
1) the average age of a person by country, region, and year
2) the number of distinct people observed in each country
3) the number of people who transitioned between countries each year, by country and year
etc

eg, I need to change between 3 sort orders:
1) sort country region year
2) sort region person_id
3) sort person_id year (to first tag the observations where person_id == person_id[_n-1] & country != country[_n-1]), then sort country year (to do the collapse)

I would think that there should be a way to use previous sort order to make the current sort more efficient. In practice, using timers in my code, it appears that the time it takes to do a sort is the same whether the data was sorted in a similar fashion already or if it was randomly sorted.  

Thanks,

Andrew Maurer

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jeph Herrin
Sent: Thursday, March 13, 2014 8:29 AM
To: [email protected]
Subject: Re: large data sets (was st: A faster way to gsort)


On 3/12/2014 11:54 PM, Joseph Coveney wrote:
>
>
> As for #1, wouldn't additional RAM be cheaper than a SAS license?  And if
> you're maxed-out on memory slots, wouldn't even a more powerful workstation be
> cheaper than a SAS license?
>

My institutional SAS 9.4 license runs me $49, so no.

More pointedly, in this situation, I must work remotely (because the 
database is on the order of several TB, and for data security reasons), 
so I don't have a lot of control over the environment.


> I don't quite follow #3.  Aren't Stata's data management operations
> incremental?  I find a series of Stata's data management commands much easier to
> walk through than a single SQL statement stretching for pages.
>

I wasn't very clear here. But when working with a >1TB database, it's 
not practical to do everything in either SAS or Stata. But to *avoid* 
writing pages of SQL one wants to submit a query that (say) pulls down a 
list of identifiers, then submit a second query that uses that list of 
identifiers to pull down related records. To do this second step in 
Stata, one would need to be able to write SQL that referenced a Stata 
file. The alternative to this incremental approach would be to write 
unreadable SQL queries.

Obviously, we all have different wants and expections from Stata. For 
me, this is the first 'big data' application I've had for Stata, and it 
hasn't done well; I have other 'big data' proposals coming up, and 
unfortunately I'm going to have to hedge my endorsement of Stata for 
this kind of work.

cheers,
J



> As for Stata's doing SQL natively, there is a comment to a post on the Stata
> Blog similarly calling for Stata to adopt SQL standard syntax.  I know that
> Jeff's comment goes beyond that, almost as if to have an ODBC driver or
> OLE DB provider for Stata dataset files.
>
> I like SQL and use it daily, but I wouldn't want StataCorp to expend its finite
> development resources in that direction.  I say this for a number of reasons
> (for a couple of examples:  the three-valued logic of NULLs and other
> peculiarities of SQL; considerations of when ad hoc SQL queries should be
> permitted and where upstream data management operations should be manifest for
> reasons of efficiency, security and regulatory compliance).
>
> So, if there's a wish-list poll somewhere for Stata 14, put me down as against
> SQL in favor of, say, -strunicode-, -menl-, -mcmc- or something along those
> lines.
>
> Joseph Coveney
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index