[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

RE: st: File sizes in Stata & SPSS (was Weights )

From   "Steichen, Thomas J." <>
To   "''" <>
Subject   RE: st: File sizes in Stata & SPSS (was Weights )
Date   Fri, 2 May 2008 12:41:42 -0400


It's called the label() option of encode.

The encode help file reads (in part):

label(name) specifies the name of the value label to be created or used and added
        to if the named value label already exists.  If label() is not specified,
        encode uses the same name for the label as it does for the new variable.


Thomas J. Steichen

-----Original Message-----
From: [] On Behalf Of Lachenbruch, Peter
Sent: Friday, May 02, 2008 11:36 AM
Subject: RE: st: File sizes in Stata & SPSS (was Weights )

If you decode and then encode again to get small files, your encoded
values may not be than same from data set to data set.  Perhaps one way
to do this is to modify the label definitions (potentially a real pain
in the neck).  Maybe someone brighter than me can come up with a simple
do file for this:  detect the unique values and retain the label
definitions for them.


Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

-----Original Message-----
[] On Behalf Of David Kantor
Sent: Friday, May 02, 2008 7:47 AM
Subject: RE: st: File sizes in Stata & SPSS (was Weights )

Hello all,

I just want to add some observations about encoding.

When you encode a string variable, the file contains a copy of every
distinct value. Consequently, it provides a space advantage usually
only if many of the values are repeated. If all or most observations
are distinct, then encoding will not gain a space advantage. (But you
may have other reasons for encoding.)

But even when encoding is advantageous in terms of space, there is
one situation when it can backfire; I had not though of this until it
happened to me. I had a large file with a string variable with many
distinct values -- though many were often repeated. I encoded it, and
gained a significant space savings.

Later, I created a multitude of smaller subsets of this file. Each
one had much fewer distinct values of the encoded variable. But each
file retained the full encoding table -- more than it needed. (Each
file replicated the encoding table.) The result was that each of the
small files were much bigger than they really needed to be. (And the
total size may have been much more then the original, even if there
had been no overlap of observations.) Subsequently, I decoded the
variable, and the files shrunk significantly.

I thought this is something to be aware of.
(It makes a potential case for having coding tables in a separate
file. But there are plenty of reasons not to have it that way.)


*   For searches and help try:

*   For searches and help try:

CONFIDENTIALITY NOTE: This e-mail message, including any
attachment(s), contains information that may be confidential,
protected by the attorney-client or other legal privileges, and/or
proprietary non-public information. If you are not an intended
recipient of this message or an authorized assistant to an intended
recipient, please notify the sender by replying to this message and
then delete it from your system. Use, dissemination, distribution,
or reproduction of this message and/or any of its attachments (if
any) by unintended recipients is not authorized and may be

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index