Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: large data sets (was st: A faster way to gsort)


From   Andrew Maurer <[email protected]>
To   "[email protected]" <[email protected]>
Subject   RE: large data sets (was st: A faster way to gsort)
Date   Thu, 13 Mar 2014 15:09:28 +0000

Thanks for all the response to this thread.

Maarten,
I'm sure that Stata's sort algorithm is very strong in terms of sorting random data. In my case, in 5 hours spent sorting, the time is broken down into approximately 5 1-hour sorts.  Although I don't know what's going on behind the scenes, it seems to me like Stata is not making use of previous sort order when sorting.

For example, my data might be panel data, including: country, region, person_id, year, age, and birthyear (birthyear = year - age). I want to get some summary information of the data, say:
1) the average age of a person by country, region, and year
2) the number of distinct people observed in each country
3) the number of people who transitioned between countries each year, by country and year
etc

eg, I need to change between 3 sort orders:
1) sort country region year
2) sort region person_id
3) sort person_id year (to first tag the observations where person_id == person_id[_n-1] & country != country[_n-1]), then sort country year (to do the collapse)

I would think that there should be a way to use previous sort order to make the current sort more efficient. In practice, using timers in my code, it appears that the time it takes to do a sort is the same whether the data was sorted in a similar fashion already or if it was randomly sorted.  

Thanks,

Andrew Maurer

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jeph Herrin
Sent: Thursday, March 13, 2014 8:29 AM
To: [email protected]
Subject: Re: large data sets (was st: A faster way to gsort)


On 3/12/2014 11:54 PM, Joseph Coveney wrote:
>
>
> As for #1, wouldn't additional RAM be cheaper than a SAS license?  And if
> you're maxed-out on memory slots, wouldn't even a more powerful workstation be
> cheaper than a SAS license?
>

My institutional SAS 9.4 license runs me $49, so no.

More pointedly, in this situation, I must work remotely (because the 
database is on the order of several TB, and for data security reasons), 
so I don't have a lot of control over the environment.


> I don't quite follow #3.  Aren't Stata's data management operations
> incremental?  I find a series of Stata's data management commands much easier to
> walk through than a single SQL statement stretching for pages.
>

I wasn't very clear here. But when working with a >1TB database, it's 
not practical to do everything in either SAS or Stata. But to *avoid* 
writing pages of SQL one wants to submit a query that (say) pulls down a 
list of identifiers, then submit a second query that uses that list of 
identifiers to pull down related records. To do this second step in 
Stata, one would need to be able to write SQL that referenced a Stata 
file. The alternative to this incremental approach would be to write 
unreadable SQL queries.

Obviously, we all have different wants and expections from Stata. For 
me, this is the first 'big data' application I've had for Stata, and it 
hasn't done well; I have other 'big data' proposals coming up, and 
unfortunately I'm going to have to hedge my endorsement of Stata for 
this kind of work.

cheers,
J



> As for Stata's doing SQL natively, there is a comment to a post on the Stata
> Blog similarly calling for Stata to adopt SQL standard syntax.  I know that
> Jeff's comment goes beyond that, almost as if to have an ODBC driver or
> OLE DB provider for Stata dataset files.
>
> I like SQL and use it daily, but I wouldn't want StataCorp to expend its finite
> development resources in that direction.  I say this for a number of reasons
> (for a couple of examples:  the three-valued logic of NULLs and other
> peculiarities of SQL; considerations of when ad hoc SQL queries should be
> permitted and where upstream data management operations should be manifest for
> reasons of efficiency, security and regulatory compliance).
>
> So, if there's a wish-list poll somewhere for Stata 14, put me down as against
> SQL in favor of, say, -strunicode-, -menl-, -mcmc- or something along those
> lines.
>
> Joseph Coveney
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/faqs/resources/statalist-faq/
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index