[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Martin Weiss" <martin.weiss@uni-tuebingen.de> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: Weights |

Date |
Wed, 30 Apr 2008 21:45:03 +0200 |

Wow, I am learning more than I thought I would in this thread... Just out of sheer curiosity, why would a split along columns allow -compress- to work better? Martin Weiss _________________________________________________________________ Diplom-Kaufmann Martin Weiss Mohlstrasse 36 Room 415 72074 Tuebingen Germany Fon: 0049-7071-2978184 Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130 Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131 SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Austin Nichols Sent: Wednesday, April 30, 2008 9:35 PM To: statalist@hsphsun2.harvard.edu Subject: Re: st: Weights Martin-- As I see it, the fault for the file size is half yours and half Stata's. You have pursued a rather odd strategy for constructing the Stata file, and not taken full advantage of possible compression because you break the file by rows (observations) rather than columns (variables). On the other side, Stata has not provided an -insheet- that can selectively insheet variables. If you follow my advice about splitting the file by variables (which Alan Riley made explicit by example and improved on by using an unmatched -merge-), you can achieve a substantially smaller file size, I suspect, but you will want to use not only -compress- but also -destring- and possibly -encode- as well. E.g. prog makesmall, sortpreserve syntax [varlist] qui compress `varlist' cap destring `varlist', replace ds `varlist', has(type string) loc str `r(varlist)' tempname tmp foreach v of loc str { bys `v': g `tmp'=_n==1 qui count if `tmp'==1 loc n=r(N) drop `tmp' if `n'<65535 { ren `v' `tmp' encode `tmp', gen(`v') move `v' `tmp' drop `tmp' } } end use a b c d e using master makesmall save part1 use f g h i j using master makesmall save part2 use part1 merge using part2 drop _merge save newmaster (untested, but should work in principle). On Wed, Apr 30, 2008 at 2:51 PM, Martin Weiss <martin.weiss@uni-tuebingen.de> wrote: > Alan, > > thanks for the reply. Someone asked how the dataset was constructed; I read > into SPSS to have the data labeled there and then had a colleague of mine > use Stat/Transfer (which claims to optimize before handing out the file) to > go to Stata. > > As for partitioning the file and compressing, I should have mentioned that I > had Stata do that overnight. What happened was that my -forvalues- loop cut > the file every 100,000th row, compressed it and saved it to hard disk. This > resulted in 29 files of 205 MB and one smaller one as the dataset has > 2,911,000 odd rows. -Compress- changed the type for roughly half of the > vars, yet the filesize decreased by only 15% on average, with very little > variation in the decrease. I appended the resulting files and, as you can > imagine, the resulting file did not fit into my 4 GB (!) mem, either. So the > difference to the SPSS file cannot be explained away by unwieldy > organization of data. > Nor do I think that cutting the file into pieces along its columns would > impact this result. I find appending easier to grasp than merging, but the > result should be equivalent. I am beginning to wonder why Stata insists on > holding the entire dataset in mem. A USB-stick does not change the equation > as it is almost as slow to respond to requests as the hard drive itself... > > Martin Weiss > _________________________________________________________________ > > Diplom-Kaufmann Martin Weiss > Mohlstrasse 36 > Room 415 > 72074 Tuebingen > Germany > > Fon: 0049-7071-2978184 > > Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130 > > Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131 > > SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945 > > -----Original Message----- > From: owner-statalist@hsphsun2.harvard.edu > [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Alan Riley > Sent: Wednesday, April 30, 2008 6:22 PM > To: statalist@hsphsun2.harvard.edu > Subject: Re: st: Weights > > > A slight correction to my previous post (included in its entirety > below my siganture): The second -use- command I showed should be > > . use f g h i j using master > > rather than > > . use d e f g h using master > > > --Alan Riley > (ariley@stata.com) > > > Alan Riley wrote: > > Martin Weiss has a dataset which started as a 2.4 GB csv file and has > > been converted to a 5.5 GB Stata .dta file. He has a 64-bit computer > > with 4 GB of RAM, which isn't quite enough to read in this dataset as > > a whole: > > > if only I could open the file and compress it... I have the latest gear > in > > > terms of hard- and software (MP/2 10.0 64 bit, 4GB RAM, Vista Business > 64 > > > bit, ...) but it is next to impossible to open the 5.5 GB file. Virtual > mem > > > makes things so slow it takes all the fun out of it... So I am stuck in > a > > > bit of a quandary. > > > > He wishes he could read it in just once and use Stata's -compress- command > > on it to store the variables more efficiently. My guess is that all > > of the variables are stored as -float- or -double- when many could > > probably be stored as smaller types such as -byte- or -int-. > > > > Austin Nichols made a couple of suggestions: > > > Can you put a 8GB memory stick on the computer--can't Vista treat > > > those as RAM? How did you turn your 2.4 GB .csv file into a 5.5GB > > > Stata file, anyway? Can you specify a different variable type in that > > > process, or save different sets of variables to different files (with > > > an identifier for later merging)? > > > > Austin's suggestion about saving different sets of variables to > > different files is exactly what I think Martin should do. > > > > First, let me say that an 8 GB memory stick would not really help. > > Although this is "memory", it is not the same kind of memory that > > is used as RAM by a computer system. These sticks are not much > > faster than hard drives when it comes to transferring large amounts > > of data, although they can 'find' files faster that are stored on > > them. > > > > If Martin has a dataset named 'master.dta' with 10 variables named > > 'a b c d e f g h i j', he could execute the following in Stata to > > compress and recombine the entire file: > > > > . use a b c d e using master > > . compress > > . save part1 > > . use d e f g h using master > > . compress > > . save part2 > > . use part1 > > . merge using part2 > > . drop _merge > > . save newmaster > > > > Martin might need to do this in 3 or 4 parts, but hopefully after > > doing the above, he will be left with a new dataset which will > > fit entirely in the RAM on his computer. > > > > > > --Alan Riley > > (ariley@stata.com) > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Weights***From:*"Austin Nichols" <austinnichols@gmail.com>

**Re: st: Weights***From:*Alan Riley <ariley@stata.com>

**Re: st: Weights***From:*Alan Riley <ariley@stata.com>

**Re: st: Weights***From:*"Austin Nichols" <austinnichols@gmail.com>

- Prev by Date:
**Re: st: Weights** - Next by Date:
**Re: st: Weights** - Previous by thread:
**Re: st: Weights** - Next by thread:
**Re: st: Weights** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |