Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: Re: relational issues: values labels for string vars

From   Christopher F Baum <>
Subject   st: Re: relational issues: values labels for string vars
Date   Sat, 26 Oct 2002 07:34:57 -0400

--On Saturday, October 26, 2002 2:33 -0400 Richard wrote:

Unfortunately all my data is loaded with string variables for codes for
various diseases, hospital procedures, geographic codes, etc. and to put
those labels as part of the database would significantly enlarge the
database. I tried the encode route but this means that every database I
have with the same set of codes has a different set of encoded values.
This would seem to be a relational database definitional issue (something that occupies altogether too many of my brain cells these days). Say you take one comprehensive set of disease codes (string) and encode them, saving those two variables (the string and the arbitrary integer which has been assigned) to a new dataset. Now make the value labels apply to that integer. You will then have a dataset with two variables: the codes, in string form, and the integer, which will then be the value label of that string. This dataset may be merged, using the string, onto any other dataset, and you will end up with those two variables in any other dataset which has a disease variable. Not quite the same as what you're requesting (which sounds reasonable) but it gives the flavor, I think, of providing the longer 'aliases' to your string codes. As Nick Cox said, this is really an issue of having 'short' and 'long' versions of the same variable, sort of like we could use 'NJC' or CFB' in one context and "Nicholas J. Cox" or "Christopher F Baum" in another.

You can tabulate the integer variable, and it will display its value label--which might be Waterhouse Friderichsen syndrome, or whatever.

This is essentially Nick Winter's suggestion, I think (including his point that this will ensure the unique definition). But I would like to promote the understanding that thinking of these things as relational database issues (even though Stata is not a RDBMS) is often useful. Furthermore, in terms of your concern for storage space, you need not keep the merged version of the dataset -- just merge the definitions file on when you need to see the 'long names', or when you're producing tables that should have those names. There may be other contexts where you're just doing data manipulation or estimation and this added detail is unnecessary. If merging the files on demand is less time-consuming then permanently adding a huge amount to their size (which will cause them to be more slowly read in), that would be a good idea.

Erik went on to say that

This may be fine if possible, but for those of us who regularly work
with millions of observations, it is often not feasible. That it is not
possible to label string values is a shortcoming that should be fixed
in future releases.
which gets to my point of this as an RDBMS issue. If there are 20,000 diseases, then there are 20,000 long "value label" forms which must be stored. Whether you call them value labels or not does not matter. The issue is whether those long-form 'labels' are to be permanently stored on each of the million records. They **need not be** if they are defined as the value label of the integer variable defined above. The overhead, as Nick Winter indicated as well, is then the addition of one integer per case -- or 4 bytes for each of the million records -- plus the space needed to store up to 64K value labels for that variable. I don't see this as much less convenient than having the value label directly attached to the original string variable -- which in this scheme is just another way of saying disease no. 1234, which now has a 'short name' like D820 and a 'long name' like Wiskott Aldrich syndrome.


* For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index