Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Weights


From   "Eva Poen" <[email protected]>
To   [email protected]
Subject   Re: st: Weights
Date   Wed, 30 Apr 2008 20:47:18 +0100

Martin,

in addition to Austin's and Alan's excellent advice, you may also want
to check for empty variables (i.e. variables that have missing values
in all observations). They will still occupy memory in Stata. You can
use -dropmiss-, by Nick Cox, to eliminate these.

The same goes for 'empty' observations, i.e. observations with missing
values in all variables. For some reason, I occasionally end up with
these when I import data into Stata. -dropmiss- can help in this case,
too.

Eva


2008/4/30 Austin Nichols <[email protected]>:
> Martin--
>  As I see it, the fault for the file size is half yours and half
>  Stata's. You have pursued a rather odd strategy for constructing the
>  Stata file, and not taken full advantage of possible compression
>  because you break the file by rows (observations) rather than columns
>  (variables). On the other side, Stata has not provided an -insheet-
>  that can selectively insheet variables.
>
>  If you follow my advice about splitting the file by variables (which
>  Alan Riley made explicit by example and improved on by using an
>  unmatched -merge-), you can achieve a substantially smaller file size,
>  I suspect, but you will want to use not only -compress- but also
>  -destring- and possibly -encode- as well.  E.g.
>
>  prog makesmall, sortpreserve
>   syntax [varlist]
>   qui compress `varlist'
>   cap destring `varlist', replace
>   ds `varlist', has(type string)
>   loc str `r(varlist)'
>   tempname tmp
>   foreach v of loc str {
>   bys `v': g `tmp'=_n==1
>   qui count if `tmp'==1
>   loc n=r(N)
>   drop `tmp'
>   if `n'<65535 {
>    ren `v' `tmp'
>    encode `tmp', gen(`v')
>    move `v' `tmp'
>    drop `tmp'
>   }
>   }
>  end
>  use a b c d e using master
>  makesmall
>  save part1
>  use f g h i j using master
>  makesmall
>  save part2
>  use part1
>  merge using part2
>  drop _merge
>  save newmaster
>
>  (untested, but should work in principle).
>
>  On Wed, Apr 30, 2008 at 2:51 PM, Martin Weiss
>  <[email protected]> wrote:
>  > Alan,
>  >
>  > thanks for the reply. Someone asked how the dataset was constructed; I read
>  > into SPSS to have the data labeled there and then had a colleague of mine
>  > use Stat/Transfer (which claims to optimize before handing out the file) to
>  > go to Stata.
>  >
>  > As for partitioning the file and compressing, I should have mentioned that I
>  > had Stata do that overnight. What happened was that my -forvalues- loop cut
>  > the file every 100,000th row, compressed it and saved it to hard disk. This
>  > resulted in 29 files of 205 MB and one smaller one as the dataset has
>  > 2,911,000 odd rows. -Compress- changed the type for roughly half of the
>  > vars, yet the filesize decreased by only 15% on average, with very little
>  > variation in the decrease. I appended the resulting files and, as you can
>  > imagine, the resulting file did not fit into my 4 GB (!) mem, either. So the
>  > difference to the SPSS file cannot be explained away by unwieldy
>  > organization of data.
>  > Nor do I think that cutting the file into pieces along its columns would
>  > impact this result. I find appending easier to grasp than merging, but the
>  > result should be equivalent. I am beginning to wonder why Stata insists on
>  > holding the entire dataset in mem. A USB-stick does not change the equation
>  > as it is almost as slow to respond to requests as the hard drive itself...
>  >
>  > Martin Weiss
>  > _________________________________________________________________
>  >
>  > Diplom-Kaufmann Martin Weiss
>  > Mohlstrasse 36
>  > Room 415
>  > 72074 Tuebingen
>  > Germany
>  >
>  > Fon: 0049-7071-2978184
>  >
>  > Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130
>  >
>  > Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131
>  >
>  > SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945
>  >
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index