[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Unused value labels [was: File sizes in Stata & SPSS (was Weights)]

From   "Friedrich Huebler" <>
Subject   st: Unused value labels [was: File sizes in Stata & SPSS (was Weights)]
Date   Sat, 3 May 2008 11:04:58 -0400

David Kantor indirectly brought up the question how unused value
labels can be removed from a dataset to reduce its size. Here is a
solution with -labelsof- from SSC.

sysuse auto
encode make, gen(make2)
drop if _n>5
labelsof make2
local labels "`r(values)'"
foreach x of local labels {
  count if make2==`x'
  if r(N)==0 {
    lab def make2 `x' "", modify
lab list make2


On Fri, May 2, 2008 at 10:46 AM, David Kantor <> wrote:
> Hello all,
>  I just want to add some observations about encoding.
>  When you encode a string variable, the file contains a copy of every
> distinct value. Consequently, it provides a space advantage usually only if
> many of the values are repeated. If all or most observations are distinct,
> then encoding will not gain a space advantage. (But you may have other
> reasons for encoding.)
>  But even when encoding is advantageous in terms of space, there is one
> situation when it can backfire; I had not though of this until it happened
> to me. I had a large file with a string variable with many distinct values
> -- though many were often repeated. I encoded it, and gained a significant
> space savings.
>  Later, I created a multitude of smaller subsets of this file. Each one had
> much fewer distinct values of the encoded variable. But each file retained
> the full encoding table -- more than it needed. (Each file replicated the
> encoding table.) The result was that each of the small files were much
> bigger than they really needed to be. (And the total size may have been much
> more then the original, even if there had been no overlap of observations.)
> Subsequently, I decoded the variable, and the files shrunk significantly.
>  I thought this is something to be aware of.
>  (It makes a potential case for having coding tables in a separate file. But
> there are plenty of reasons not to have it that way.)
>  --David
*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index