Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: Weights


From   "Martin Weiss" <martin.weiss@uni-tuebingen.de>
To   <statalist@hsphsun2.harvard.edu>
Subject   RE: st: Weights
Date   Wed, 30 Apr 2008 20:51:59 +0200

Alan,

thanks for the reply. Someone asked how the dataset was constructed; I read
into SPSS to have the data labeled there and then had a colleague of mine
use Stat/Transfer (which claims to optimize before handing out the file) to
go to Stata.
 
As for partitioning the file and compressing, I should have mentioned that I
had Stata do that overnight. What happened was that my -forvalues- loop cut
the file every 100,000th row, compressed it and saved it to hard disk. This
resulted in 29 files of 205 MB and one smaller one as the dataset has
2,911,000 odd rows. -Compress- changed the type for roughly half of the
vars, yet the filesize decreased by only 15% on average, with very little
variation in the decrease. I appended the resulting files and, as you can
imagine, the resulting file did not fit into my 4 GB (!) mem, either. So the
difference to the SPSS file cannot be explained away by unwieldy
organization of data.
Nor do I think that cutting the file into pieces along its columns would
impact this result. I find appending easier to grasp than merging, but the
result should be equivalent. I am beginning to wonder why Stata insists on
holding the entire dataset in mem. A USB-stick does not change the equation
as it is almost as slow to respond to requests as the hard drive itself...

Martin Weiss
_________________________________________________________________

Diplom-Kaufmann Martin Weiss
Mohlstrasse 36
Room 415
72074 Tuebingen
Germany

Fon: 0049-7071-2978184

Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130

Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131

SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Alan Riley
Sent: Wednesday, April 30, 2008 6:22 PM
To: statalist@hsphsun2.harvard.edu
Subject: Re: st: Weights

A slight correction to my previous post (included in its entirety
below my siganture):  The second -use- command I showed should be

   . use f g h i j using master

rather than

   . use d e f g h using master


--Alan Riley
(ariley@stata.com)


Alan Riley wrote:
> Martin Weiss has a dataset which started as a 2.4 GB csv file and has
> been converted to a 5.5 GB Stata .dta file.  He has a 64-bit computer
> with 4 GB of RAM, which isn't quite enough to read in this dataset as
> a whole:
> > if only I could open the file and compress it... I have the latest gear
in
> > terms of hard- and software (MP/2 10.0 64 bit, 4GB RAM, Vista Business
64
> > bit, ...) but it is next to impossible to open the 5.5 GB file. Virtual
mem
> > makes things so slow it takes all the fun out of it... So I am stuck in
a
> > bit of a quandary.
> 
> He wishes he could read it in just once and use Stata's -compress- command
> on it to store the variables more efficiently.  My guess is that all
> of the variables are stored as -float- or -double- when many could
> probably be stored as smaller types such as -byte- or -int-.
> 
> Austin Nichols made a couple of suggestions:
> > Can you put a 8GB memory stick on the computer--can't Vista treat
> > those as RAM?  How did you turn your 2.4 GB .csv file into a 5.5GB
> > Stata file, anyway?  Can you specify a different variable type in that
> > process, or save different sets of variables to different files (with
> > an identifier for later merging)? 
> 
> Austin's suggestion about saving different sets of variables to
> different files is exactly what I think Martin should do.
> 
> First, let me say that an 8 GB memory stick would not really help.
> Although this is "memory", it is not the same kind of memory that
> is used as RAM by a computer system.  These sticks are not much
> faster than hard drives when it comes to transferring large amounts
> of data, although they can 'find' files faster that are stored on
> them.
> 
> If Martin has a dataset named 'master.dta' with 10 variables named
> 'a b c d e f g h i j', he could execute the following in Stata to
> compress and recombine the entire file:
> 
>    . use a b c d e using master
>    . compress
>    . save part1
>    . use d e f g h using master
>    . compress
>    . save part2
>    . use part1
>    . merge using part2
>    . drop _merge
>    . save newmaster
> 
> Martin might need to do this in 3 or 4 parts, but hopefully after
> doing the above, he will be left with a new dataset which will
> fit entirely in the RAM on his computer.
> 
> 
> --Alan Riley
> (ariley@stata.com)
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index