Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Machine spec for 70GB data, Summary

From	Gindo Tampubolon <[email protected]>
To	"[email protected]" <[email protected]>
Subject	Re: st: Machine spec for 70GB data, Summary
Date	Mon, 24 Oct 2011 08:11:14 +0000

Dear all,

Thanks for all the informative and prompt reply, in particular to Yuval, Jeroen, Billy, Dan, Joerg, Buzz. It seems worthwhile to explore other ways/platforms for doing this stuff.
Gindo

----------------------------------------------------------------------
Jeroen wrote:
I read your question on using stata to fit large cross-classified models -- on a 70Gb dataset.
I am afraid the performance is very problematic. While I use Stata for most of my work
fittings mixed models in Stata is somehow problematic -- too inefficient. Recently, a tool
has become avaibel to fit mixed models (included models with crossed REs) in Mlwin from
within Stata (search for runmlwin) -- the performance difference is staggering.
----------------------------------------------------------------------
Yuval wrote:
Are you sure the data file is 70GB? I'm using Windows operating system
and I recently succeded to run a file of 1.29 GB that includes above
4 million observations. Here are the few raws from the do file. Just
make sure to use the "set memory" command:
----------------------------------------------------------------------
Billy wrote:
Contrary to prior responses to your request, the set memory command is unnecessary when using Stata 12. If your dataset is 70GB, you would need at least that much RAM in addition to the RAM necessary for your computer to run.
----------------------------------------------------------------------
Dan wrote:
Once you have the 64-bit versions the operating system and Stata Linux v
Windows won't make much difference, but you really need to establish how
much memory you will need. Machines that offer more than 24GB of memory
are much more expensive than smaller machines so you can save quite a bit
if you can limit your maximum "set memory" to 18 GB or so.

If you are able to read a subset of the data into a machine you already
have, that can give you an idea of how much memory you will need for the
full dataset. You say "a few million observations" but unless "few" means
thousands you should be able to get by with far less than 70GB of memory.
You don't say how many variables, or how many are float or int. If you
have 250 ints, you can store nearly a million observations per GB. Stata
doesn't need much more memory than that which is used for the data.

I have posted some suggestions for working with large datasets in Stata at
http://www.nber.org/sys-admin/large-stata-datasets.html

the main point of which is that if you separate the sample selection from
the analysis steps, it is possible to work with very large datasets in
reasonable core sizes (if the analysis is only on a subset, of course).

There is some information on the Stata website:
http://www.stata.com/support/faqs/win/winmemory.html
http://www.stata.com/support/faqs/data/dataset.html

It is possible to get computers with up to 256 GB of memory for
reasonable prices (for some definitions of reasonable, such as
$US25,000) and that can be convinient. It probably isn't necessary,
though.
----------------------------------------------------------------------
Joerg wrote:
What are "a few millions"? If by that you mean like a handful then you
must have a ton of variables. If you do not need all of them for your
analyses, you can read the data in in chunks, set up the variables you
need, and eventually put it together again. However, in my experience
it seems difficult to fit more complicated multilevel models in Stata
when sample size becomes large. I find this to be especially true in
the case of models with crossed random effects. So just beware, even
if you get all the data you want into memory, you may not be able to
run the model you propose.
----------------------------------------------------------------------
Buzz wrote:
I concur with Joerg Luedicke's statalist response to your question. My
experience is similar to his in that large complicated multilevel models may
be extremely time consuming to fit.

See http://www.stata.com/statalist/archive/2010-09/msg00424.html
which indicates one problem, although cluster robust SE are available in
- -xtmixed- for Stata 12.

Also, there will be little advantage of MP Processing for -xtmixed-
See page 33 at http://www.stata.com/statamp/statamp.pdf
----------------------------------------------------------------------

*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Machine spec for 70GB data, Summary
  - From: Yuval Arbel <[email protected]>

Prev by Date: Re: st: Fwd: Comparing marginal effects of two subsamples
Next by Date: Re: st: conception confusion - "fixed effects" and time effect on data with time factor
Previous by thread: st: variable not found?
Next by thread: Re: st: Machine spec for 70GB data, Summary
Index(es):
- Date
- Thread