[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: data set larger than RAM
I have had lots of difficulties similar to the one you are encountering.
In the old days some packages allowed you to simply use harddisk space as
RAM. This was very slow, but faster than not being able to estimate the
model at all. Now that RAM has generally grown, that approach has fallen
by the wayside. That seems a pity; of course RAM is more efficient in
terms of time, but it is impossible to know whether someone somewhere will
have need of more memory than is available in RAM. The harddisk as RAM
option was a nice way out of that difficulty, if there were no other
An alternative that may work for you is to use the -contract- command to
make a dataset of frequencies. If you do that for groups separately, you
should be able to greatly reduce the size of the dataset if the
combinations of values on the regressors are not such that each
observation has one unique combination. This may be a quicker way to go,
and perhaps less cumbersom.
As for your solution, it seems reasonable to me. So you luck out;
sometimes one cannot estimate the model so easily without having access
to all the data, in RAM or on disk, at the same time.
On Wed, 2 Nov 2005, Thomas Cornelißen wrote:
> Dear Stata Users,
> suppose you had K=5000 regressors and N=10 million observations. This might
> be from a linked employer employee data set where you explicitly include all
> firm dummies (and algebraically sweep out person effects through the within
> transformation) in order to compute person and firm effects.
> As I understand, Stata SE is capable of using up to 11 000 variables. But in
> this case the data matrix would be 50 GB assuming that each regressor can be
> stored as a 1 byte variable.
> What do you think of the following solutions, that necessarily will require
> to use the data partly form the hard disk and not from the RAM:
> 1) Store the data in several files, two of which can be loaded into the
> memory at a time. Compute the elements of X'X and X'Y by subsequently
> loading all possible pairwise combinations of the data sets into the memory
> and multyplying the data matrices. Store the results and use them to set
> together X'y and X' X, which is then a 5000 x 5000 matrix that fits into the
> memory and can be inverted by Stata. As I might with some luck have 16 GB of
> RAM available, this would mean about 7 sub datasets of about 7 GB each (i.e.
> about 700 regressors and 10 million observations each sub data set).
> 2) I read that SAS is able to handle datasets that do not fit into the RAM.
> Does SAS do something like what I described under 1) ? Is there a good
> discussion forum about SAS that might give advice on whether such a large
> dataset can be handled by the software? (I am sorry to aks this on
> 3) I have the feeling that with 1) and 2) I exchange the space restriction
> for a time restriction. Any solution, 1) or 2) might be excessively time
> consuming, so that it is just illusionary to compute least squares estimates
> from such a huge dataset. Before computing X'X I need to "time-demean" the
> firm dummies (within-transformation), which might also be very
> time-comsuming, when there are 5000 regressors and 2 million persons
> observed during 5 time periods. Supposedly the time restriction is real,
> that's why several authors propose alternative estimation methods for person
> and firm effects in linked employer employee data. Do you have any
> suggestions about how to estimate the time needed ?
> Has anybody had a similar problem or thought this through? I would be glad
> for any comments.
> Thomas Cornelissen
> Institute of Empirical Economic Research
> University of Hannover, Germany
> * For searches and help try:
> * http://www.stata.com/support/faqs/res/findit.html
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
* For searches and help try: