Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: File sizes in Stata & SPSS (was Weights )


From   "Martin Weiss" <[email protected]>
To   <[email protected]>
Subject   RE: st: File sizes in Stata & SPSS (was Weights )
Date   Fri, 2 May 2008 16:53:20 +0200

Well,

interesting thoughts. Maybe I am overzealous in wanting the whole dataset in
memory, but I have a hunch that this should be possible in the latest
version of an otherwise perfect statistical program. I have never touched
the outer limits of the capabilities of hard- and software, so this is a new
situation for me. Having limited my research to the right tail of the income
distribution (which begins at two times average income), the size of the
file has dropped to 1.9G which fits comfortably into my memory without
touching virtual mem. 
As advice to those who are bitten by my problem, I consider -describe using-
as particularly helpful as it lets you peek into the contents of a file
without actually opening it. Also bear in mind that -db use_option- lets you
select cases and / or vars before you actually open the file. 

Martin Weiss
_________________________________________________________________

Diplom-Kaufmann Martin Weiss
Mohlstrasse 36
Room 415
72074 Tuebingen
Germany

Fon: 0049-7071-2978184

Home: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1130

Publications: http://www.wiwi.uni-tuebingen.de/cms/index.php?id=1131

SSRN: http://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=669945

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Paul Seed
Sent: Friday, May 02, 2008 4:04 PM
To: [email protected]
Subject: RE: st: File sizes in Stata & SPSS (was Weights )

Dear Statalist, 

Martin Weiss <[email protected]> has been asking for help in
handling an extremely long file that seems to gain size when converted from
CSV to Stata, but not to SPSS.  For reasons of confidentiality, he cannot
tell us want is in it; but some comments suggest that the problem might
relate to variable length strings.  For instance, there might be a comment
filed that is generally blank, but in a few cases contains a long &
extremely detailed response.  (A hymn of praise or a bitter complaint,
perhaps).

As Stata allocates each string variable a fixed length, there will be a lot
of unused space. As SPSS can store strings of variable length, it will make
use of this.

To check this out, I wrote a script that produces 3 files: example1 contains
a string of 30 characters that is always full; example2 contains a  similar
string that is blank except in the first record (similar to Martin's file as
imagined); example3 encodes the string in example2. After saving the files,
I copied them to SPSS using Stat/Transfer, and then  checked the file sizes.

In examples 1 & 3, Stata gives smaller files.  Only in example 2 does SPSS
"win".  

In this case, there is no loss of information due to encoding, as the
maximum length of the string is less than 244 characters. If Martin Weiss
has strings longer than this, and cares about the details contained beyond
character 244, he is perhaps involved in qualitative analysis for which
neither SPSS nor Stata are very useful. 



The code is below.

clear
set obs 30000
gen n = _n
gen string = "123456789012345678901234567890" 
compress 
memory
save example1, replace

replace string = "" if _n > 1
compress 
memory
save example2, replace

encode string, gen(string_)
drop string

compress
memory
save example3, replace

* Copy files to SPSS before continuing
pause on
pause

dir example*.*

Paul;

Paul Seed, Senior Lecturer in Medical Statistics
KCL School of Medicine, Division of Reproduction and Endocrinology
tel� (+44) (0) 20 7188 3642





*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index