Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: st: encode results in false match - merge/joinby

From	"Clyde Schechter" <[email protected]>
To	[email protected]
Subject	Re: Re: st: encode results in false match - merge/joinby
Date	Fri, 11 Feb 2011 09:04:45 -0800

<>
"I wonder if this strange behavior of encoded variables
is limited only to 'join' or could it be an issue also in other
contexts (?)."

The primary question about merge/join has been answered by others.

The general observation that encode produces a numeric variable based on
the levels of the string variable observed in the data set, labeled to
look like the original string variable leads to the following conclusion:

Using -encode- on the same variable in multiple data sets that will later
be combined (by any operation, e.g. -append-) is dangerous.

Having been bitten by this many times, I have now developed some
precautionary data management practices.

1.  There are certain types of variables that recur frequently in my work.
 For many of these I have developed a standard encoding that I always use.
 The code to create these standard value labels is immortalized in some
do-files that I routinely either -do-, -run- or -include- in my data-set
creation do files.  (I've even thought of including them in my profile.do,
but decided that was a bit much.)  These value labels cover all the
possible values these variables can take.  Whenever I -encode- one of
these variables, I always explicitly use the label() option with these
labels.

2.  In large projects that will involve multiple data sets with
overlapping variables not part of my "standard" list, whenever I use
-encode-, I routinely follow that up with a -label save- to immortalize
that particular encoding.  In later work with the same variable in other
data sets, before I -encode-, I -do-, -run-, or -include- the
corresponding labeling do-file, and then use the explicit label() option
in the -encode- command.  If -encode- finds new levels of the variable not
already in the label, it adds them to the label. And I follow that up
using -label save, replace- again so my labeler do-file remains
up-to-date.

3.  With regard to #2, so I do not rely on my memory as to whether I have
previously developed a labeling for a variable, my practice for these
non-routine variables is to give the value label the same name as the
variable, and name the labeler do-file varname_label.do.  Then, when I
want to -encode- such a variable, I precede the -encode- with -capture run
varname_label.do-.  (In fact, I have a little .ado file that is a wrapper
for -encode- that handles all this for me.)

While these practices seem cumbersome, and can lead to a project directory
being a bit cluttered with little do-files that just generate labels,
adherence to them has saved me from some pretty nasty analysis errors that
are hard to root out otherwise.


Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine

Please note new e-mail address: [email protected]

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Prev by Date: Re: st: Kappa weights & category detection
Next by Date: st: -svmat- with matrix colnames based on factor variable names?
Previous by thread: Re: st: encode results in false match - merge/joinby
Next by thread: st: stepwise and manual drop of variables
Index(es):
- Date
- Thread