Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: encode results in false match - merge/joinby


From   Eric Booth <[email protected]>
To   "<[email protected]>" <[email protected]>
Subject   Re: st: encode results in false match - merge/joinby
Date   Thu, 10 Feb 2011 21:38:30 +0000

<>

BTW, I like using Ben Jann's -fre- (from SSC)  to examine values and value labels together.  

Try: 

*****
cap which fre
if _rc   ssc install fre, replace
fre code1
*****


- Eric
__
Eric A. Booth
Public Policy Research Institute
Texas A&M University
[email protected]
Office: +979.845.6754


On Feb 10, 2011, at 3:32 PM, Eric Booth wrote:

> <>
> 
> On Feb 10, 2011, at 3:07 PM, joe j wrote:
>> I wonder if this strange behavior of encoded variables
>> is limited only to 'join' or could it be an issue also in other
>> contexts (?). Thanks for any pointers.
> 
> This is expected behavior.   -encode- is creating a numeric version of your string variable with value labels equivalent to the strings in the oldvar. 
> Your -joinby- results are unexpected (at least to you, not to Stata) only because you are looking at the value labels, not the values, and -merge-/-joinby-/etc  use the values, not value labels to combine data.
> 
> When you encode a string variable, Stata will assign values starting at 1 for the first obs (unless you use -encode-'s label option to change this).   
> Take a look at the values underlying the labels for your code1 variable by typing:  
> 
> ta code1
> ta code1, nol
> *or*
> browse, nolabel
> 
> 
> See -help encode- for more detail on what -encode- is doing to your string variables.
> 
> -Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> [email protected]
> Office: +979.845.6754
> 
> 
> On Feb 10, 2011, at 3:07 PM, joe j wrote:
> 
>> I just wanted to highlight something I encountered while merging two
>> data sets with encoded merge variables . The two tables in reality are
>> a perfect non-match. This is also the case when I use the matching
>> variable 'code' in the string format. But if I encode them and
>> generate a variable 'code1' and use that for merging there is a
>> perfect match. (Now, I don't remember why I encoded this
>> variable-there must have been a reason but that was definitely not
>> aimed at merge.)
>> 
>> Below is an example with two files being joined with string variable
>> 'code' and encoded variable 'code1'--the latter results in a false
>> perfect match. I wonder if this strange behavior of encoded variables
>> is limited only to 'join' or could it be an issue also in other
>> contexts (?). Thanks for any pointers.
>> 
>> clear
>> input id str5 code
>> 1 "123J5"
>> 2 "68741"
>> 3 "297J5"
>> 4 "14856"
>> 5 "AB234"
>> 6 "25K45"
>> 7 "12535"
>> end
>> encode code, gen(code1)
>> sort code1
>> save file1.dta, replace
>> 
>> clear
>> input id str5 code
>> 1 "243J5"
>> 2 "68348"
>> 3 "479H5"
>> 4 "467G5"
>> 5 "23TUB"
>> 6 "TU501"
>> 7 "32LK8"
>> end
>> encode code, gen(code1)
>> 
>> joinby code1 using file1.dta, unmatched(both) /*perfect match*/
>> *joinby code using file1.dta, unmatched(both) /*perfect non-match*
>> 
>> ta _m


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index