Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: encode results in false match - merge/joinby


From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: encode results in false match - merge/joinby
Date   Thu, 10 Feb 2011 21:32:22 +0000

<>

On Feb 10, 2011, at 3:07 PM, joe j wrote:
> I wonder if this strange behavior of encoded variables
> is limited only to 'join' or could it be an issue also in other
> contexts (?). Thanks for any pointers.

This is expected behavior.   -encode- is creating a numeric version of your string variable with value labels equivalent to the strings in the oldvar. 
 Your -joinby- results are unexpected (at least to you, not to Stata) only because you are looking at the value labels, not the values, and -merge-/-joinby-/etc  use the values, not value labels to combine data.

When you encode a string variable, Stata will assign values starting at 1 for the first obs (unless you use -encode-'s label option to change this).   
Take a look at the values underlying the labels for your code1 variable by typing:  

ta code1
ta code1, nol
*or*
browse, nolabel


See -help encode- for more detail on what -encode- is doing to your string variables.

-Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754


On Feb 10, 2011, at 3:07 PM, joe j wrote:

> I just wanted to highlight something I encountered while merging two
> data sets with encoded merge variables . The two tables in reality are
> a perfect non-match. This is also the case when I use the matching
> variable 'code' in the string format. But if I encode them and
> generate a variable 'code1' and use that for merging there is a
> perfect match. (Now, I don't remember why I encoded this
> variable-there must have been a reason but that was definitely not
> aimed at merge.)
> 
> Below is an example with two files being joined with string variable
> 'code' and encoded variable 'code1'--the latter results in a false
> perfect match. I wonder if this strange behavior of encoded variables
> is limited only to 'join' or could it be an issue also in other
> contexts (?). Thanks for any pointers.
> 
> clear
> input id str5 code
> 1 "123J5"
> 2 "68741"
> 3 "297J5"
> 4 "14856"
> 5 "AB234"
> 6 "25K45"
> 7 "12535"
> end
> encode code, gen(code1)
> sort code1
> save file1.dta, replace
> 
> clear
> input id str5 code
> 1 "243J5"
> 2 "68348"
> 3 "479H5"
> 4 "467G5"
> 5 "23TUB"
> 6 "TU501"
> 7 "32LK8"
> end
> encode code, gen(code1)
> 
> joinby code1 using file1.dta, unmatched(both) /*perfect match*/
> *joinby code using file1.dta, unmatched(both) /*perfect non-match*
> 
> ta _m
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/



*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index