Statalist The Stata Listserver


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: hardware + OS for large datasets


From   Joseph Coveney <jcoveney@bigplanet.com>
To   Statalist <statalist@hsphsun2.harvard.edu>
Subject   Re: st: hardware + OS for large datasets
Date   Sun, 22 Oct 2006 13:24:13 +0900

Jeph Herrin wrote:

[snip]
I am about to start two projects that involve very
large datasets; and I need to make some decisions about
whether the largest chunk I can handle in Stata will be
adequate. The budget for the hardware is very generous
and as a former Unix adminstrator I'd welcome a chance to
have a Linux box here again (though I suppose win64 is
also an option), so I'm thinking a new Linux box with
gobs of RAM.

However, other than the theoretical limit of the 64-bit
address space, I wonder what it is like in practice to
load and save (say) 20GB datasets using Stata/MP (or SE).
Does the Stata memory model (such a huge boon for smaller
datasets) have practical limitations? How about 64GB datasets?
I'm concerned about spending a fortune on RAM and then finding
it's not practical to work with.

This is particularly an issue because the investment will be
funded by a group that maintains the database in S#S, and
they would rather just buy me a S#S license; if I go wrong,
it won't be easy to go back for another kick. So I'd very
much like to hear about any experiences, good or bad, of
those working with very large datasets, and what their
insight into OS and number of processors (or cores) might
be.

--------------------------------------------------------------------------------

You'll need to pardon my na´ve attempt on this, but no one else seems to
have taken you up.  I can't answer your questions other than to note that at
least one of the methods that Stata uses for data management (viz., merge)
relies upon sorting the datasets, and that sorting becomes
disproportionately slower with increasing numbers of observations regardless
of whether the dataset can be loaded at once into memory.  (Stata does use
some kind of indexing system for the dataset in memory--you can see the
"overhead (pointers)" item when you examine memory--but I'm not sure how
much help it lends for sorting activities.)  Assuming that your colleagues'
SAS "tables" aren't particularly highly normalized, with sizes in the 20- to
64-gigabyte range, it seems as if you're looking at tens of millions of
observations to perhaps a couple of hundred million observations.  So, if
you're loading such datasets into memory in order to sort them in
preparation for merges, and if you plan on doing this on a frequent basis,
then you'd be better off rethinking your strategy.  SAS has a couple of
methods that it can use in merges, such as user formats and hash tables, but
these can have practical limitations, too.  You can use SQL in SAS for a
join on indexed SAS datasets, but apparently its query optimizer opts to do
a sort on occasion anyway, much to the user's consternation.

On the other hand, if you need to run linearly through the entire file for
other purposes, then most of the time spent would be with physical file I/O.
SAS's ability to put indexes on its datasets might help somewhat in some of
these kinds of tasks (perhaps in subsetting, for example), but I'm not sure
to what extent you'll see practical timesavings, especially if the dataset
isn't static.

Calculating statistics with very large numbers of observations risks
encountering numerical problems.  From posts by Bill Gould on the subject
over the years, it's clear that StataCorp has thought about this.  SAS
Institute undoubtedly also has, so from this standpoint, it would probably
be a wash.  Depending upon what you're doing, you could also risk running
out of space trying to accommodate large sparse design matrixes, and I'm not
certain that SAS or Stata would be any different here, either.  In addition,
for statistical analyses beyond what can be done with accumulators, it would
seem that SAS, too, needs to read all of the observations into memory.

I guess a lot of the decision depends upon what you plan to read an entire
64-gigabyte SAS dataset into memory for.  If it's for data management, then
perhaps it might not be practical to use Stata regardless of how much RAM
you have.  But on the other hand, it's not certain that you'd be better off
with some RDBMS wannabe, either.  If it's not primarily for data management,
then it might not make much difference whether you use Stata or SAS, other
than for the conventional considerations of repertoire and ease of use.

Joseph Coveney

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index