[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Lachenbruch, Peter" <Peter.Lachenbruch@oregonstate.edu> |

To |
<statalist@hsphsun2.harvard.edu> |

Subject |
RE: st: File sizes in Stata & SPSS (was Weights ) |

Date |
Fri, 2 May 2008 08:35:42 -0700 |

If you decode and then encode again to get small files, your encoded values may not be than same from data set to data set. Perhaps one way to do this is to modify the label definitions (potentially a real pain in the neck). Maybe someone brighter than me can come up with a simple do file for this: detect the unique values and retain the label definitions for them. Tony Peter A. Lachenbruch Department of Public Health Oregon State University Corvallis, OR 97330 Phone: 541-737-3832 FAX: 541-737-4001 -----Original Message----- From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of David Kantor Sent: Friday, May 02, 2008 7:47 AM To: statalist@hsphsun2.harvard.edu Subject: RE: st: File sizes in Stata & SPSS (was Weights ) Hello all, I just want to add some observations about encoding. When you encode a string variable, the file contains a copy of every distinct value. Consequently, it provides a space advantage usually only if many of the values are repeated. If all or most observations are distinct, then encoding will not gain a space advantage. (But you may have other reasons for encoding.) But even when encoding is advantageous in terms of space, there is one situation when it can backfire; I had not though of this until it happened to me. I had a large file with a string variable with many distinct values -- though many were often repeated. I encoded it, and gained a significant space savings. Later, I created a multitude of smaller subsets of this file. Each one had much fewer distinct values of the encoded variable. But each file retained the full encoding table -- more than it needed. (Each file replicated the encoding table.) The result was that each of the small files were much bigger than they really needed to be. (And the total size may have been much more then the original, even if there had been no overlap of observations.) Subsequently, I decoded the variable, and the files shrunk significantly. I thought this is something to be aware of. (It makes a potential case for having coding tables in a separate file. But there are plenty of reasons not to have it that way.) --David * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**RE: st: File sizes in Stata & SPSS (was Weights )***From:*"Steichen, Thomas J." <SteichT@rjrt.com>

**References**:**RE: st: File sizes in Stata & SPSS (was Weights )***From:*"Paul Seed" <paul.seed@kcl.ac.uk>

**RE: st: File sizes in Stata & SPSS (was Weights )***From:*David Kantor <kantor.d@att.net>

- Prev by Date:
**st: RE: Correct code for Poi's QUAIDS model** - Next by Date:
**Re: st: Correct code for Poi's QUAIDS model** - Previous by thread:
**RE: st: File sizes in Stata & SPSS (was Weights )** - Next by thread:
**RE: st: File sizes in Stata & SPSS (was Weights )** - Index(es):

© Copyright 1996–2016 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |