Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Unused value labels [was: File sizes in Stata & SPSS (was Weights)]


From   "Friedrich Huebler" <fhuebler@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   st: Unused value labels [was: File sizes in Stata & SPSS (was Weights)]
Date   Sat, 3 May 2008 11:04:58 -0400

David Kantor indirectly brought up the question how unused value
labels can be removed from a dataset to reduce its size. Here is a
solution with -labelsof- from SSC.

sysuse auto
encode make, gen(make2)
drop if _n>5
labelsof make2
local labels "`r(values)'"
foreach x of local labels {
  count if make2==`x'
  if r(N)==0 {
    lab def make2 `x' "", modify
  }
}
lab list make2

Friedrich

On Fri, May 2, 2008 at 10:46 AM, David Kantor <kantor.d@att.net> wrote:
> Hello all,
>
>  I just want to add some observations about encoding.
>
>  When you encode a string variable, the file contains a copy of every
> distinct value. Consequently, it provides a space advantage usually only if
> many of the values are repeated. If all or most observations are distinct,
> then encoding will not gain a space advantage. (But you may have other
> reasons for encoding.)
>
>  But even when encoding is advantageous in terms of space, there is one
> situation when it can backfire; I had not though of this until it happened
> to me. I had a large file with a string variable with many distinct values
> -- though many were often repeated. I encoded it, and gained a significant
> space savings.
>
>  Later, I created a multitude of smaller subsets of this file. Each one had
> much fewer distinct values of the encoded variable. But each file retained
> the full encoding table -- more than it needed. (Each file replicated the
> encoding table.) The result was that each of the small files were much
> bigger than they really needed to be. (And the total size may have been much
> more then the original, even if there had been no overlap of observations.)
> Subsequently, I decoded the variable, and the files shrunk significantly.
>
>  I thought this is something to be aware of.
>  (It makes a potential case for having coding tables in a separate file. But
> there are plenty of reasons not to have it that way.)
>
>  --David
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index