Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: st: encode results in false match - merge/joinby

From   "Clyde Schechter" <>
Subject   Re: Re: st: encode results in false match - merge/joinby
Date   Fri, 11 Feb 2011 09:04:45 -0800

"I wonder if this strange behavior of encoded variables
is limited only to 'join' or could it be an issue also in other
contexts (?)."

The primary question about merge/join has been answered by others.

The general observation that encode produces a numeric variable based on
the levels of the string variable observed in the data set, labeled to
look like the original string variable leads to the following conclusion:

Using -encode- on the same variable in multiple data sets that will later
be combined (by any operation, e.g. -append-) is dangerous.

Having been bitten by this many times, I have now developed some
precautionary data management practices.

1.  There are certain types of variables that recur frequently in my work.
 For many of these I have developed a standard encoding that I always use.
 The code to create these standard value labels is immortalized in some
do-files that I routinely either -do-, -run- or -include- in my data-set
creation do files.  (I've even thought of including them in my,
but decided that was a bit much.)  These value labels cover all the
possible values these variables can take.  Whenever I -encode- one of
these variables, I always explicitly use the label() option with these

2.  In large projects that will involve multiple data sets with
overlapping variables not part of my "standard" list, whenever I use
-encode-, I routinely follow that up with a -label save- to immortalize
that particular encoding.  In later work with the same variable in other
data sets, before I -encode-, I -do-, -run-, or -include- the
corresponding labeling do-file, and then use the explicit label() option
in the -encode- command.  If -encode- finds new levels of the variable not
already in the label, it adds them to the label. And I follow that up
using -label save, replace- again so my labeler do-file remains

3.  With regard to #2, so I do not rely on my memory as to whether I have
previously developed a labeling for a variable, my practice for these
non-routine variables is to give the value label the same name as the
variable, and name the labeler do-file  Then, when I
want to -encode- such a variable, I precede the -encode- with -capture run  (In fact, I have a little .ado file that is a wrapper
for -encode- that handles all this for me.)

While these practices seem cumbersome, and can lead to a project directory
being a bit cluttered with little do-files that just generate labels,
adherence to them has saved me from some pretty nasty analysis errors that
are hard to root out otherwise.

Clyde Schechter, MA MD
Associate Professor of Family & Social Medicine

Please note new e-mail address:

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index