[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: data set larger than RAM

From	Thomas Corneli�en <[email protected]>
To	<[email protected]>
Subject	Re: st: data set larger than RAM
Date	Thu, 10 Nov 2005 17:01:32 +0100

This is a follow-up on my previous question "st: data set larger than RAM" on handling large datasets (in the order of 10 millions of observations).

Bill Gould adviced me to worry about numerical accuracy when using such large datasets.

If I understood it right: Using the conventional regression tools may lead to inaccurate results in such large datasets due to problems of numerical precision when summing up so many numbers.

I was adviced to
- use quad precision, for instance quadcross() available in Mata
- normalize the variables to mean 0 and variance 1
- use solver functionality instead of inverses
- take much care and double-check results

I have two follow-up questions about this:
(1) Can I improve on this strategy if I replace quadcross() by doing the cross product sums manually using the "mean update rule" that Bill mentioned in the previous discussion? Or does quadcross() itself employ the mean update rule or take care of the problem by proceeding from the smallest to the largest number when summing up?

(2) Should I also normalize dummy variables and categorial variables to mean 0 and variance 1 when computing X'X ? (I worry that this changes categorial variables from integers to real numbers and therefore increase the memory space needed.)

I would be glad for further comments or advice on these issues.
Thomas

-------------------------------------------------------------------------
Thomas Cornelissen
Institute of Empirical Economic Research
University of Hannover, Germany
[email protected]
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: cut variables
Next by Date: st: Creating the rank of one variable from the distribution of another variable
Previous by thread: Re: st: data set larger than RAM
Next by thread: Re: st: data set larger than RAM
Index(es):
- Date
- Thread