Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Store datafile at minimum possible file size


From   Henrik Stovring <stovring@BIOSTAT.AU.DK>
To   <statalist@hsphsun2.harvard.edu>
Subject   Re: st: Store datafile at minimum possible file size
Date   Fri, 16 Apr 2010 19:32:29 +0200

Martin,

The zipfile command works on already stored data-files (as far as I can
tell), while my commands are basically equivalents of the save, use, and
merge commands. In other words, if you use my commands, you bypass the
step of having the actual .dta-dataset residing on your disk, as only a
.dta.zip file is created in your directory of choice. The compression
itself is done on a dataset that is "automagically" stored in your
Stata-sessions temporary directory (/tmp for example on a Linux
machine), and this dataset is removed by Stata without further ado, when
your Stata session ends.

Best,

Henrik

Martin Weiss wrote:
> <>
> 
> Henrik,
> 
> how does your package compare to the now official -zipfile- command?
> 
> 
> HTH
> Martin
> 
> 
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Henrik Stovring
> Sent: Freitag, 16. April 2010 19:15
> To: statalist@hsphsun2.harvard.edu
> Subject: Re: st: Store datafile at minimum possible file size
> 
> Please excuse me for advertising packages written by myself, but you may
> find the -zipsave-package useful, as it includes a -zipuse- and a
> -zipmerge- command that make the zip-files more readily accessible.
> 
> Best,
> 
> Henrik
> 
> Michael Boehm wrote:
>> Thanks again, both of these suggestions sound like I could make
>> profitable use of them :)
>>
>> Michael
>>
>> On Fri, Apr 16, 2010 at 2:55 PM, Pavlos C. Symeou <p.symeou@lmu.de> wrote:
>>> Well, from my experience, I just had to try this to surprise myself. I had
>>> an enormous dataset 14.5G consisting of 600 string variables and more than
>>> 35000 observations. Exporting the dataset to tab-separated format resulted
>>> in a file of about 800M. Compressing it to the Zip format resulted in a file
>>> a bit less than 18M. That is an amazing difference. However, the problem
>>> always remains, at least in my case, when the time for analysis comes. I
>>> will still have to convert the compressed file back to the .dta format, and
>>> then get back to the 14.5G. At least I can save all my files on a single
>>> memory stick:)
>>>
>>> Cheers,
>>>
>>> Pavlos
>>>
>>> On 16/04/2010 14:49, Stefan.Gawrich@hlpug.hessen.de wrote:
>>>> -zipfile- has already been mentioned.
>>>>
>>>> Inside Stata you can use -encode- to change a string var to numeric with
>>>> value labels.
>>>> In case you have a lot of string repetitions in the data this can shrink
>>>> the file size to a small fraction.
>>>> With -decode- you can always go back.
>>>>
>>>> ***
>>>>
>>>> You can even output the encoded file to ASCII and restore the value labels
>>>> in other software by a script or a dictionary file if the small filesize is
>>>> worth the extra effort.
>>>> A few times I used Stata to create such a dictionary or script (e.g. in
>>>> SQL).
>>>>
>>>>
>>>> In case that all commands have the same structure (often with SQL -update-
>>>> or -insert- scripts),
>>>> you can use Stata's data window to "write" it. Some hints how to do this:
>>>>
>>>> You must do this separately for every var you want to process in this way:
>>>>
>>>> First -levelsof- hands the levels to a local. Do a -foreach- loop over
>>>> this local.
>>>> Extended macro function -label- stores the value labels created by
>>>> -encode- in locals.
>>>> The local names should contain the level number (like "loc123") so you can
>>>> refer to it later.
>>>>
>>>> Now you can use -duplicates- with option "drop" to keep unique levels of
>>>> this var.
>>>> Delete all other vars and write commands as constant string vars.
>>>> Loop over levels to insert the fitting local values (value label strings)
>>>> to the numeric values.
>>>> Use -order- to put all parts of the commands into the right place.
>>>>
>>>> Copy and paste the data editor to a text editor and you have a script.
>>>>
>>>> Stefan
>>>>
>>>>
>>>> *
>>>> *   For searches and help try:
>>>> *http://www.stata.com/help.cgi?search
>>>> *http://www.stata.com/support/statalist/faq
>>>> *http://www.ats.ucla.edu/stat/stata/
>>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
> 

-- 
Henrik Støvring			Department of Biostatistics
Associate professor            	University of Aarhus
stovring@biostat.au.dk     	Bartholins Allé 2, Bldg 1261, 217
Phone +45 8942 6131            	8000 Aarhus
Fax +45 8942 6140              	Denmark
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index