Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: reading in huge dataset using "odbc"


From   "Joseph Coveney" <jcoveney@bigplanet.com>
To   "Statalist" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: reading in huge dataset using "odbc"
Date   Mon, 14 Jul 2008 23:48:30 +0900

Natalie Chan wrote:

I am trying to read in a dataset with more than 200 variables and 1.2
million observations. The stata produced the message "obs must be
between 62593 and 62593. As a possible solution, try increasing memory
using -set memory-". I increased the memory but still it did not work.

I then decided to read in the dataset separately and then append the
sub-datasets. I read in the first 62000 observations, it was
successful. When I try to read in the observations 62001-124000, there
was no error message, but when I checked the data, there was no
observation read in and there was only the row of observation names
read in.

The commands I used are:

odbc load in 1/62000, table ("firms")

odbc load in 62001/124000, table ("firms")

Could anyone here give some advice? Thanks a lot.

--------------------------------------------------------------------------------

Have you made sure that the Access database table contains all 1.2 million
observations that you think that it does?  You mentioned earlier that the
dataset is 2.1 gigabytes.  Even the most recent version of Access (2007)
limits its tables as well as its database files to two gigabytes (and that
includes overhead, not just data), and limits its recordsets in queries to
one gigabyte.

Others have already suggested reading in only the variables that you need
for your analysis.  Do you need all 1.2 million observations, or will a
random (or stratified random) sample suffice to get sufficiently precise
estimates for your model?

I calculate an average value as 5.25 bytes (2.1 gigabytes / 1.2 million rows
/ 200 columns), so the database appears to be set up to use the smallest
datatypes suitable; nevertheless, if many of the columns are VARCHAR() or
CHAR() and store names of categories, then you can do some data management
to convert them into encoded byte or integer datatypes with value labels
that will save space.  You can do this in Stata or in the database.  In
Stata, you'd pull the character columns along with the primary key, -encode-
the string variables, -sort- on the primary key variables, -save-, pull in
the noncharacter columns and primary key, -sort-, and -merge-.  In the
database, you'd normalize (200 columns?).  Reference tables (a.k.a., lookup
tables, translation tables) would have a TINYINT or INT as the primary key
that's referenced in a FOREIGN KEY declaration in the data table, and the
character categories with a UNIQUE constraint.

If you need all 200 variables and all 1.2 million observations and they're
all as small a numeric datatype as possible, then I'm afraid that you're
going to need to install more memory (RAM, not a stick).

Joseph Coveney


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index