[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
Thomas Cornelißen <cornelissen@mbox.iqw.uni-hannover.de> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
Re: st: data set larger than RAM |

Date |
Thu, 10 Nov 2005 17:01:32 +0100 |

This is a follow-up on my previous question "st: data set larger than RAM" on handling large datasets (in the order of 10 millions of observations).

Bill Gould adviced me to worry about numerical accuracy when using such large datasets.

If I understood it right: Using the conventional regression tools may lead to inaccurate results in such large datasets due to problems of numerical precision when summing up so many numbers.

I was adviced to

- use quad precision, for instance quadcross() available in Mata

- normalize the variables to mean 0 and variance 1

- use solver functionality instead of inverses

- take much care and double-check results

I have two follow-up questions about this:

(1) Can I improve on this strategy if I replace quadcross() by doing the cross product sums manually using the "mean update rule" that Bill mentioned in the previous discussion? Or does quadcross() itself employ the mean update rule or take care of the problem by proceeding from the smallest to the largest number when summing up?

(2) Should I also normalize dummy variables and categorial variables to mean 0 and variance 1 when computing X'X ? (I worry that this changes categorial variables from integers to real numbers and therefore increase the memory space needed.)

I would be glad for further comments or advice on these issues.

Thomas

-------------------------------------------------------------------------

Thomas Cornelissen

Institute of Empirical Economic Research

University of Hannover, Germany

cornelissen@mbox.iqw.uni-hannover.de

*

* For searches and help try:

* http://www.stata.com/support/faqs/res/findit.html

* http://www.stata.com/support/statalist/faq

* http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re: st: cut variables** - Next by Date:
**st: Creating the rank of one variable from the distribution of another variable** - Previous by thread:
**Re: st: data set larger than RAM** - Next by thread:
**Re: st: data set larger than RAM** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |