Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Gindo Tampubolon <Gindo.Tampubolon@manchester.ac.uk> |

To |
"statalist@hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: Machine spec for 70GB data, Summary |

Date |
Mon, 24 Oct 2011 08:11:14 +0000 |

Dear all, Thanks for all the informative and prompt reply, in particular to Yuval, Jeroen, Billy, Dan, Joerg, Buzz. It seems worthwhile to explore other ways/platforms for doing this stuff. Gindo ---------------------------------------------------------------------- Jeroen wrote: I read your question on using stata to fit large cross-classified models -- on a 70Gb dataset. I am afraid the performance is very problematic. While I use Stata for most of my work fittings mixed models in Stata is somehow problematic -- too inefficient. Recently, a tool has become avaibel to fit mixed models (included models with crossed REs) in Mlwin from within Stata (search for runmlwin) -- the performance difference is staggering. ---------------------------------------------------------------------- Yuval wrote: Are you sure the data file is 70GB? I'm using Windows operating system and I recently succeded to run a file of 1.29 GB that includes above 4 million observations. Here are the few raws from the do file. Just make sure to use the "set memory" command: ---------------------------------------------------------------------- Billy wrote: Contrary to prior responses to your request, the set memory command is unnecessary when using Stata 12. If your dataset is 70GB, you would need at least that much RAM in addition to the RAM necessary for your computer to run. ---------------------------------------------------------------------- Dan wrote: Once you have the 64-bit versions the operating system and Stata Linux v Windows won't make much difference, but you really need to establish how much memory you will need. Machines that offer more than 24GB of memory are much more expensive than smaller machines so you can save quite a bit if you can limit your maximum "set memory" to 18 GB or so. If you are able to read a subset of the data into a machine you already have, that can give you an idea of how much memory you will need for the full dataset. You say "a few million observations" but unless "few" means thousands you should be able to get by with far less than 70GB of memory. You don't say how many variables, or how many are float or int. If you have 250 ints, you can store nearly a million observations per GB. Stata doesn't need much more memory than that which is used for the data. I have posted some suggestions for working with large datasets in Stata at http://www.nber.org/sys-admin/large-stata-datasets.html the main point of which is that if you separate the sample selection from the analysis steps, it is possible to work with very large datasets in reasonable core sizes (if the analysis is only on a subset, of course). There is some information on the Stata website: http://www.stata.com/support/faqs/win/winmemory.html http://www.stata.com/support/faqs/data/dataset.html It is possible to get computers with up to 256 GB of memory for reasonable prices (for some definitions of reasonable, such as $US25,000) and that can be convinient. It probably isn't necessary, though. ---------------------------------------------------------------------- Joerg wrote: What are "a few millions"? If by that you mean like a handful then you must have a ton of variables. If you do not need all of them for your analyses, you can read the data in in chunks, set up the variables you need, and eventually put it together again. However, in my experience it seems difficult to fit more complicated multilevel models in Stata when sample size becomes large. I find this to be especially true in the case of models with crossed random effects. So just beware, even if you get all the data you want into memory, you may not be able to run the model you propose. ---------------------------------------------------------------------- Buzz wrote: I concur with Joerg Luedicke's statalist response to your question. My experience is similar to his in that large complicated multilevel models may be extremely time consuming to fit. See http://www.stata.com/statalist/archive/2010-09/msg00424.html which indicates one problem, although cluster robust SE are available in - -xtmixed- for Stata 12. Also, there will be little advantage of MP Processing for -xtmixed- See page 33 at http://www.stata.com/statamp/statamp.pdf ---------------------------------------------------------------------- * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: Machine spec for 70GB data, Summary***From:*Yuval Arbel <yuval.arbel@gmail.com>

- Prev by Date:
**Re: st: Fwd: Comparing marginal effects of two subsamples** - Next by Date:
**Re: st: conception confusion - "fixed effects" and time effect on data with time factor** - Previous by thread:
**st: variable not found?** - Next by thread:
**Re: st: Machine spec for 70GB data, Summary** - Index(es):