Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: still troubles in loading a big dataset in Stata---- Help Please!!!


From   "Jann, Ben" <ben.jann@soz.gess.ethz.ch>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: still troubles in loading a big dataset in Stata---- Help Please!!!
Date   Mon, 8 Sep 2003 10:53:35 +0200

Shqiponja wrote:
> I am trying to load a big dataset with approx. 790,000 
> observations and
> 170 variables in Stata. I am having trouble loading the data 
> first.  The
> data is an SPSS file. I saved the data as a tab-delimited file in SPSS
> and tried to open in Stata. However, got the following message:
>
> . insheet using c:\yellisall.dat
> no room to add more observations

and:
> Thank you for your suggestions, but I am still having troubles in
> loading the data.
>
> I started with an empty Stata and did increase the amount of memory
> allocated to the data and also the matsize, by using the following:
> 
> .set memory 726m
> .set matsize 200
> 
> but the programme was too slow, and the data could not be 
> loaded at all.
> 
> Do you have any other suggestion? 

There have been quite a few replies. However, I'm not sure if they solve
Shqiponja's problem. Maybe first determine how big your dataset will be.
-insheet- will use float storage type by default. For your dataset
(790000 cases) this means approx. 3 mb per variable (assuming all
variables to be numeric) or approx. 525 mb for the whole dataset (170
variables). I guess this exceeds or comes close to the phisical memory
of your computer. However, probably a lot of the variables of your
dataset could be stored more efficiently. For example, the size of the
dataset would be approx. 131 mb if all variables were byte.

The problem is, that -insheet- cannot assign different storage types to
different variables and that all cases have to be read. So, I'd propose
to use -infile- (the only disadvantage of -infile- is that you have to
specify the variable names) and either read the data in pieces,
-compress- and -append- (1) or assigng storrage types while reading (2).

(1) Example:
 . infile varlist using yellisall.dat in 395001/790000 
   // assuming there are 790000 cases
 . compress
 . save piece2
 . infile varlist using yellisall.dat in 2/395000, clear
   // assuming the data start in record 2
 . compress
 . append using piece2
 . save yellisall
 . erase piece2.dta

(substitute 'varlist' with the list of variable names of your data)

(2) Example:
 . infile byte(v1 v2 v3) float v4 int(v5 v6) ... using yellisall.dat
 . compress
 . save yellisall

A third approach would be to only read certain variables:

 . infile v1 v2 v3 _skip(4) v8 v9 ... using yellisall.dat

This, however, is only useful if you know that you definitely will not
need some of the variables.

Furthermore, make sure that yellisall.dat is okay. You will run into
problems, for example, if SPSS writes "" (nothing) for system missings
instead of "." (period).

I hope this helps, ben

 

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index