Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Re: relational issues: values labels for string vars


From   "HealthMaps" <healthmaps@attbi.com>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Re: relational issues: values labels for string vars
Date   Sat, 26 Oct 2002 22:06:38 -0700

SAS, SPSS. and S-Plus allow value labels for string variables. Also they
allow the development of value labels independent of the database being
value labeled. (Proc Format)

STATA does not, at least not without some (considerable) rigamorrole.

Maybe Stata people will fix this.

At present the soltuion (that is quickest)  seems to be  developing a new
variable using the valkue label as a variable value. This is not
database-wise efficient. And these labels are not easily reduced to short
strings; subtle disease distinctions are difficult to reduce to a few
characters.

It looks like there are 2 votes for an addition to Stata's considerable
capacity.

Richard Hoskins
WA State Dept of Health
Olympia, WA 98502
GMT -8



-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu]On Behalf Of Christopher F
Baum
Sent: Saturday, October 26, 2002 4:35 AM
To: statalist@hsphsun2.harvard.edu
Subject: st: Re: relational issues: values labels for string vars



--On Saturday, October 26, 2002 2:33 -0400 Richard wrote:

> Unfortunately all my data is loaded with string variables for codes for
> various diseases, hospital procedures, geographic codes, etc. and to put
> those labels as part of the database would significantly enlarge the
> database. I tried the encode route but this means that every database I
> have with the same set of codes has a different set of encoded values.

This would seem to be a relational database definitional issue (something
that occupies altogether too many of my brain cells these days). Say you
take one comprehensive set of disease codes (string) and encode them,
saving those two variables (the string and the arbitrary integer which has
been assigned) to a new dataset. Now make the value labels apply to that
integer. You will then have a dataset with two variables: the codes, in
string form, and the integer, which will then be the value label of that
string. This dataset may be merged, using the string, onto any other
dataset, and you will end up with those two variables in any other dataset
which has a disease variable. Not quite the same as what you're requesting
(which sounds reasonable) but it gives the flavor, I think, of providing
the longer 'aliases' to your string codes. As Nick Cox said, this is really
an issue of having 'short' and 'long' versions of the same variable, sort
of like we could use 'NJC' or CFB' in one context and "Nicholas J. Cox" or
"Christopher F Baum" in another.

You can tabulate the integer variable, and it will display its value
label--which might be  Waterhouse Friderichsen syndrome, or whatever.

This is essentially Nick Winter's suggestion, I think (including his point
that this will ensure the unique definition). But I would like to promote
the understanding that thinking of these things as relational database
issues (even though Stata is not a RDBMS) is often useful. Furthermore, in
terms of your concern for storage space, you need not keep the merged
version of the dataset -- just merge the definitions file on when you need
to see the 'long names', or when you're producing tables that should have
those names. There may be other contexts where you're just doing data
manipulation or estimation and this added detail is unnecessary. If merging
the files on demand is less time-consuming then permanently adding a huge
amount to their size (which will cause them to be more slowly read in),
that would be a good idea.

Erik went on to say that

> This may be fine if possible, but for those of us who regularly work
> with millions of observations, it is often not feasible. That it is not
> possible to label string values is a shortcoming that should be fixed
> in future releases.

which gets to my point of this as an RDBMS issue. If there are 20,000
diseases, then there are 20,000 long "value label" forms which must be
stored. Whether you call them value labels or not does not matter. The
issue is whether those long-form 'labels' are to be permanently stored on
each of the million records. They **need not be** if they are defined as
the value label of the integer variable defined above. The overhead, as
Nick Winter indicated as well, is then the addition of one integer per case
-- or 4 bytes for each of the million records -- plus the space needed to
store up to 64K value labels for that variable. I don't see this as much
less convenient than having the value label directly attached to the
original string variable -- which in this scheme is just another way of
saying disease no. 1234, which now has a 'short name' like D820 and a 'long
name' like Wiskott Aldrich syndrome.

Kit

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index