[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
RE: st: Weights

From	"Martin Weiss" <[email protected]>
To	<[email protected]>
Subject	RE: st: Weights
Date	Wed, 30 Apr 2008 21:45:03 +0200
Wow, I am learning more than I thought I would in this thread... Just out of
sheer curiosity, why would a split along columns allow -compress- to work
better?

Martin Weiss
_________________________________________________________________

Diplom-Kaufmann Martin Weiss
Mohlstrasse 36
Room 415
72074 Tuebingen
Germany

Fon: 0049-7071-2978184

Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130

Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131

SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Austin Nichols
Sent: Wednesday, April 30, 2008 9:35 PM
To: [email protected]
Subject: Re: st: Weights

Martin--
As I see it, the fault for the file size is half yours and half
Stata's. You have pursued a rather odd strategy for constructing the
Stata file, and not taken full advantage of possible compression
because you break the file by rows (observations) rather than columns
(variables). On the other side, Stata has not provided an -insheet-
that can selectively insheet variables.

If you follow my advice about splitting the file by variables (which
Alan Riley made explicit by example and improved on by using an
unmatched -merge-), you can achieve a substantially smaller file size,
I suspect, but you will want to use not only -compress- but also
-destring- and possibly -encode- as well.  E.g.

prog makesmall, sortpreserve
 syntax [varlist]
 qui compress `varlist'
 cap destring `varlist', replace
 ds `varlist', has(type string)
 loc str `r(varlist)'
 tempname tmp
 foreach v of loc str {
  bys `v': g `tmp'=_n==1
  qui count if `tmp'==1
  loc n=r(N)
  drop `tmp'
  if `n'<65535 {
   ren `v' `tmp'
   encode `tmp', gen(`v')
   move `v' `tmp'
   drop `tmp'
  }
 }
end
use a b c d e using master
makesmall
save part1
use f g h i j using master
makesmall
save part2
use part1
merge using part2
drop _merge
save newmaster

(untested, but should work in principle).

On Wed, Apr 30, 2008 at 2:51 PM, Martin Weiss
<[email protected]> wrote:
> Alan,
>
> thanks for the reply. Someone asked how the dataset was constructed; I
read
> into SPSS to have the data labeled there and then had a colleague of mine
> use Stat/Transfer (which claims to optimize before handing out the file)
to
> go to Stata.
>
> As for partitioning the file and compressing, I should have mentioned that
I
> had Stata do that overnight. What happened was that my -forvalues- loop
cut
> the file every 100,000th row, compressed it and saved it to hard disk.
This
> resulted in 29 files of 205 MB and one smaller one as the dataset has
> 2,911,000 odd rows. -Compress- changed the type for roughly half of the
> vars, yet the filesize decreased by only 15% on average, with very little
> variation in the decrease. I appended the resulting files and, as you can
> imagine, the resulting file did not fit into my 4 GB (!) mem, either. So
the
> difference to the SPSS file cannot be explained away by unwieldy
> organization of data.
> Nor do I think that cutting the file into pieces along its columns would
> impact this result. I find appending easier to grasp than merging, but the
> result should be equivalent. I am beginning to wonder why Stata insists on
> holding the entire dataset in mem. A USB-stick does not change the
equation
> as it is almost as slow to respond to requests as the hard drive itself...
>
> Martin Weiss
> _________________________________________________________________
>
> Diplom-Kaufmann Martin Weiss
> Mohlstrasse 36
> Room 415
> 72074 Tuebingen
> Germany
>
> Fon: 0049-7071-2978184
>
> Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130
>
> Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131
>
> SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945
>
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Alan Riley
> Sent: Wednesday, April 30, 2008 6:22 PM
> To: [email protected]
> Subject: Re: st: Weights
>
>
> A slight correction to my previous post (included in its entirety
> below my siganture):  The second -use- command I showed should be
>
>   . use f g h i j using master
>
> rather than
>
>   . use d e f g h using master
>
>
> --Alan Riley
> ([email protected])
>
>
> Alan Riley wrote:
> > Martin Weiss has a dataset which started as a 2.4 GB csv file and has
> > been converted to a 5.5 GB Stata .dta file.  He has a 64-bit computer
> > with 4 GB of RAM, which isn't quite enough to read in this dataset as
> > a whole:
> > > if only I could open the file and compress it... I have the latest
gear
> in
> > > terms of hard- and software (MP/2 10.0 64 bit, 4GB RAM, Vista Business
> 64
> > > bit, ...) but it is next to impossible to open the 5.5 GB file.
Virtual
> mem
> > > makes things so slow it takes all the fun out of it... So I am stuck
in
> a
> > > bit of a quandary.
> >
> > He wishes he could read it in just once and use Stata's -compress-
command
> > on it to store the variables more efficiently.  My guess is that all
> > of the variables are stored as -float- or -double- when many could
> > probably be stored as smaller types such as -byte- or -int-.
> >
> > Austin Nichols made a couple of suggestions:
> > > Can you put a 8GB memory stick on the computer--can't Vista treat
> > > those as RAM?  How did you turn your 2.4 GB .csv file into a 5.5GB
> > > Stata file, anyway?  Can you specify a different variable type in that
> > > process, or save different sets of variables to different files (with
> > > an identifier for later merging)?
> >
> > Austin's suggestion about saving different sets of variables to
> > different files is exactly what I think Martin should do.
> >
> > First, let me say that an 8 GB memory stick would not really help.
> > Although this is "memory", it is not the same kind of memory that
> > is used as RAM by a computer system.  These sticks are not much
> > faster than hard drives when it comes to transferring large amounts
> > of data, although they can 'find' files faster that are stored on
> > them.
> >
> > If Martin has a dataset named 'master.dta' with 10 variables named
> > 'a b c d e f g h i j', he could execute the following in Stata to
> > compress and recombine the entire file:
> >
> >    . use a b c d e using master
> >    . compress
> >    . save part1
> >    . use d e f g h using master
> >    . compress
> >    . save part2
> >    . use part1
> >    . merge using part2
> >    . drop _merge
> >    . save newmaster
> >
> > Martin might need to do this in 3 or 4 parts, but hopefully after
> > doing the above, he will be left with a new dataset which will
> > fit entirely in the RAM on his computer.
> >
> >
> > --Alan Riley
> > ([email protected])
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
References:
- Re: st: Weights
  - From: "Austin Nichols" <[email protected]>
- Re: st: Weights
  - From: Alan Riley <[email protected]>
- Re: st: Weights
  - From: Alan Riley <[email protected]>
- Re: st: Weights
  - From: "Austin Nichols" <[email protected]>
Prev by Date: Re: st: Weights
Next by Date: Re: st: Weights
Previous by thread: Re: st: Weights
Next by thread: Re: st: Weights
Index(es):
- Date
- Thread