Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: large data sets (was st: A faster way to gsort)


From   "Joseph Coveney" <[email protected]>
To   <[email protected]>
Subject   Re: large data sets (was st: A faster way to gsort)
Date   Fri, 14 Mar 2014 11:44:09 +0900

Jeph Herrin wrote:

>
> As for #1, wouldn't additional RAM be cheaper than a SAS license?  And if
> you're maxed-out on memory slots, wouldn't even a more powerful workstation be
> cheaper than a SAS license?
>

My institutional SAS 9.4 license runs me $49, so no.

More pointedly, in this situation, I must work remotely (because the 
database is on the order of several TB, and for data security reasons), 
so I don't have a lot of control over the environment.

> I don't quite follow #3.  Aren't Stata's data management operations
> incremental?  I find a series of Stata's data management commands much easier
to
> walk through than a single SQL statement stretching for pages.
>

I wasn't very clear here. But when working with a >1TB database, it's 
not practical to do everything in either SAS or Stata. But to *avoid* 
writing pages of SQL one wants to submit a query that (say) pulls down a 
list of identifiers, then submit a second query that uses that list of 
identifiers to pull down related records. To do this second step in 
Stata, one would need to be able to write SQL that referenced a Stata 
file. The alternative to this incremental approach would be to write 
unreadable SQL queries.

Obviously, we all have different wants and expections from Stata. For 
me, this is the first 'big data' application I've had for Stata, and it 
hasn't done well; I have other 'big data' proposals coming up, and 
unfortunately I'm going to have to hedge my endorsement of Stata for 
this kind of work.

--------------------------------------------------------------------------------

It sounds like you're pulling modest-to-large result sets out of the database,
saving them as SAS dataset files and then going back and sort-merging them via
PROC SQL with multigigabyte-sized result sets likewise pulled out of the
database en passant--a situation that even SAS aficionados recommend avoiding in
favor of pass-through queries.

I can sympathize about not having any say over the equipment you rely upon to do
your job, but I still recommend what I mentioned yesterday:  push the queries
back into the RDBMS where SQL can shine in its natural environment.  Use indexed
views and temporary tables as warranted, take advantage of common table
expressions etc. to aid in query construction, pay attention to indexing and
hinting, and wrap things in stored procedures that Stata can call to get its
analysis datasets ready-made, or something reasonably close to it.  You can
always allow parameters in your stored procedures if you need to provide for
additional flexibility in what the returned analysis datasets contain.  Your DBA
is already allowing you ad hoc queries against the database, so security
concerns are apparently satisfied.

And then reward yourself for saving $49 by cutting out the middle man (as well
as for saving umpteen cumulative hours of query-processing time).

Joseph Coveney

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index