[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Weights

From   "Joseph Coveney" <[email protected]>
To   "Statalist" <[email protected]>
Subject   Re: st: Weights
Date   Fri, 2 May 2008 01:31:55 +0900

Martin Weiss wrote [excerpted from three posts]:

My dataset features a size of 2.4 GB as .csv. When I translate this into
SPSS, it ends up with 2.7 GB while the equivalent Stata dataset has 5.5 GB

. . . the dataset has 2,911,000 odd rows.

. . . I cannot check for every var as there are over 600 of them


I don't get it:

. di 2911000 * 600 / 2.4e9

Did I get the math right?  If so, then that's less than one byte per
variable--with a comma-separated value ASCII file, there's got to be at
least one byte per variable for the comma alone (and that leaves nothing for
any data).

I understand that you aren't allowed to talk about the dataset in any
detail, so Statalist is limited in what it can do to help you, but something
about the dataset just doesn't add up.

Arithmetic aside, about the only circumstance that I can think of where a
Stata dataset file on disc after -compress- and -recast- would be twice the
size of the corresponding SPSS dataset file would be, as Sergiy mentions, a
case in which a dataset has many wildly variably sized string variables
where SPSS can save space by going to a more efficient SQL-like VARCHAR()
data type while Stata maintains a fixed string length.  If that's the case
here and if you're going to summarize them with, say, -tabulate-, then pull
those columns individually or in small groups.  You could also use -encode-
as Austin mentions to reduce them to one-byte (ca. 100 distinct values),
four-bytes (ca. 32000 distinct values) or eight-bytes (unique) each, again,
examined individually or in small groups of columns at a time (if you need
to do cross-tabulations, for example).

Joseph Coveney

*   For searches and help try:

© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index