Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Solution to former question: Large dataset . Slow Stata


From   "Tobias Pfaff" <tobiaspfaff@uni-muenster.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Solution to former question: Large dataset . Slow Stata
Date   Fri, 3 Aug 2007 09:57:14 +0200

> I am using Intercooled Stata 9.2 on a laptop with an AMD 1.79 GHz 
> processor and 512 MB ram. So far, Stata worked well with all datasets.
>
> Now, we are analyzing a larger dataset with 70 variables and 180,000 
> observations. The dta-file has 224 MB. It takes alone two minutes to 
> open the file, not to mention the processing time of simple operations 
> like
> -drop- or -replace-. Is that normal? I have tried -compress-, which 
> does not have any major impact on the file size.
>
> What is a PROFESSIONAL WAY to handle such a dataset?

---------------------------------------------------------------

OK, let me sum up how I finally approached the problem, after the helpful
hints given by Statalisters.

I had set my memory to 600m (-set mem 600m, permanently-), which was
apprently too big. According to Bill's explanation I set my memory to 300m
which worked better. -memory- might also give you some information about how
much memory Stata actually uses.

Svend mentioned http://www.stata.com/support/faqs/data/howbig.html. So I
knew, that my dataset with 70 number and integer variables should roughly be
around 100 MB (remember that it was 226 MB initially). -codebook- gives you
information about your variables and tells you, if string variables are set
too large (e.g. str255, but you never have a string longer than 30
characters). This problem can be easily solved with -compress- as it adapts
the length of each string variable to what is needed.

Finally, I used -destring- for string variables that contained only numbers,
and changed string variables that contained dates to date format (cf. code
below).

All in all, the size of my dataset shrinked from 226 MB to 106 MB! My laptop
is now able to handle the data with acceptable processing time.

Cheers,
Tobias


My code:
--------

set more off
clear

** IMPORT PRODUCT DATA

insheet using ../01_SQL/xsdv.txt, delimiter("|")

// Drop first and last line
drop in 1
drop in 188036

// Extract varnames
forvalues num = 1/66 {

	local varname = substr(v`num',1,3)
	rename v`num' F`varname'

	replace F`varname' = substr(F`varname',5,.) }

// Drop huge textfield(s)
drop F042 F065

// Reduce string lengths
compress

// Change to number format
destring F030 F031 F037 F049 F052 F053 F070, replace

// Remove dash for missing value
foreach list in 007 025 069 {
	replace F`list' = "" if F`list' == "-"
}

// Replace comma with point for decimal values and change to number format
foreach list in 015 015_2 018 018_2 026 027 028 029 043 045 046 047 048 050
059 060 068 {
	replace F`list' = subinstr(F`list',",",".",1)
	destring F`list', replace
}

// Change date variables to date format
foreach list in 017 020 021 022 023 056 057 058{
	gen F`list'_date = date(F`list', "dmy")
	format F`list'_date %dCY-N-D
	drop F`list'
	rename F`list'_date F`list'
}

// Order variables alphabetically
aorder

save masterdata_products.dta, replace

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index