Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data set larger than RAM


From   [email protected] (William Gould, Stata)
To   [email protected]
Subject   Re: st: data set larger than RAM
Date   Thu, 10 Nov 2005 12:28:56 -0600

Thomas Cornelissen <[email protected]> wrote, 

> This is a follow-up on my previous question "st: data set larger than RAM"
> on handling large datasets (in the order of 10 millions of observations).
>
> [...]
>
>
> I was adviced to
> - use quad precision, for instance quadcross() available in Mata
> - normalize the variables to mean 0 and variance 1
> - use solver functionality instead of inverses
> - take much care and double-check results

Yes, in a nutshell.  In terms of double-checking results, I suggested 
drawing a random subsample of the data (say 50,000, or 100,000, or 200,000
obs., as convenient) and then using standard Stata commands.  The results
Thomas produces with the full dataset should be similar to those produced with
the subsample.

Hence, Thomas may decide not to do all of the above.  He might, for instance,
omit normalization.  If he finds similar results, then that would be 
adequate.

> (1) Can I improve on this strategy if I replace quadcross() by doing the
> cross product sums manually using the "mean update rule" that Bill mentioned
> in the previous discussion? Or does quadcross() itself employ the mean
> update rule or take care of the problem by proceeding from the smallest to
> the largest number when summing up?

With a million observations, use quadcross().  It would not be worth the
effort to implement the mean-update rule.  The advantage of the mean-update
rule is savings in memory and, perhaps, computer time.  There is a value of N
such that, for n>N, the mean-update rule will be more accurate even than quad
precission, but not a mere million observations.

> (2) Should I also normalize dummy variables and categorial variables to mean 
> 0 and variance 1 when computing X'X ? (I worry that this changes categorial 
> variables from integers to real numbers and therefore increase the memory 
> space needed.)

That is not necessary.  Given you are using quad precission for the matrix
sum, you may be able to skip normalization altogether, especially if your
variables have reasonable (approximately equal) means.  You want to avoid
having one variable with a mean of .0001 and another with a mean of 3,960,100
(which is 1990^2, and I have seen people include calendar-year squared in
regressions).  If all variables have means between 0 and 100, I would not
expect difficulty.

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index