[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Friedrich Huebler" <fhuebler@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
st: Unused value labels [was: File sizes in Stata & SPSS (was Weights)] |

Date |
Sat, 3 May 2008 11:04:58 -0400 |

David Kantor indirectly brought up the question how unused value labels can be removed from a dataset to reduce its size. Here is a solution with -labelsof- from SSC. sysuse auto encode make, gen(make2) drop if _n>5 labelsof make2 local labels "`r(values)'" foreach x of local labels { count if make2==`x' if r(N)==0 { lab def make2 `x' "", modify } } lab list make2 Friedrich On Fri, May 2, 2008 at 10:46 AM, David Kantor <kantor.d@att.net> wrote: > Hello all, > > I just want to add some observations about encoding. > > When you encode a string variable, the file contains a copy of every > distinct value. Consequently, it provides a space advantage usually only if > many of the values are repeated. If all or most observations are distinct, > then encoding will not gain a space advantage. (But you may have other > reasons for encoding.) > > But even when encoding is advantageous in terms of space, there is one > situation when it can backfire; I had not though of this until it happened > to me. I had a large file with a string variable with many distinct values > -- though many were often repeated. I encoded it, and gained a significant > space savings. > > Later, I created a multitude of smaller subsets of this file. Each one had > much fewer distinct values of the encoded variable. But each file retained > the full encoding table -- more than it needed. (Each file replicated the > encoding table.) The result was that each of the small files were much > bigger than they really needed to be. (And the total size may have been much > more then the original, even if there had been no overlap of observations.) > Subsequently, I decoded the variable, and the files shrunk significantly. > > I thought this is something to be aware of. > (It makes a potential case for having coding tables in a separate file. But > there are plenty of reasons not to have it that way.) > > --David * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**Re(2): st: How to draw a smoothed hazard curve with graph twoway** - Next by Date:
**st: re; Granger with fixed effect - panel data** - Previous by thread:
**st: How to draw a smoothed hazard curve with graph twoway** - Next by thread:
**st: re; Granger with fixed effect - panel data** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |