Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: encode results in false match - merge/joinby


From   joe j <joe.stata@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: encode results in false match - merge/joinby
Date   Thu, 10 Feb 2011 22:56:06 +0100

Thanks Eric. I understand it now. -fre- looks interesting indeed!
Joe.

On Thu, Feb 10, 2011 at 10:38 PM, Eric Booth <ebooth@ppri.tamu.edu> wrote:
> <>
>
> BTW, I like using Ben Jann's -fre- (from SSC)  to examine values and value labels together.
>
> Try:
>
> *****
> cap which fre
> if _rc   ssc install fre, replace
> fre code1
> *****
>
>
> - Eric
> __
> Eric A. Booth
> Public Policy Research Institute
> Texas A&M University
> ebooth@ppri.tamu.edu
> Office: +979.845.6754
>
>
> On Feb 10, 2011, at 3:32 PM, Eric Booth wrote:
>
>> <>
>>
>> On Feb 10, 2011, at 3:07 PM, joe j wrote:
>>> I wonder if this strange behavior of encoded variables
>>> is limited only to 'join' or could it be an issue also in other
>>> contexts (?). Thanks for any pointers.
>>
>> This is expected behavior.   -encode- is creating a numeric version of your string variable with value labels equivalent to the strings in the oldvar.
>> Your -joinby- results are unexpected (at least to you, not to Stata) only because you are looking at the value labels, not the values, and -merge-/-joinby-/etc  use the values, not value labels to combine data.
>>
>> When you encode a string variable, Stata will assign values starting at 1 for the first obs (unless you use -encode-'s label option to change this).
>> Take a look at the values underlying the labels for your code1 variable by typing:
>>
>> ta code1
>> ta code1, nol
>> *or*
>> browse, nolabel
>>
>>
>> See -help encode- for more detail on what -encode- is doing to your string variables.
>>
>> -Eric
>> __
>> Eric A. Booth
>> Public Policy Research Institute
>> Texas A&M University
>> ebooth@ppri.tamu.edu
>> Office: +979.845.6754
>>
>>
>> On Feb 10, 2011, at 3:07 PM, joe j wrote:
>>
>>> I just wanted to highlight something I encountered while merging two
>>> data sets with encoded merge variables . The two tables in reality are
>>> a perfect non-match. This is also the case when I use the matching
>>> variable 'code' in the string format. But if I encode them and
>>> generate a variable 'code1' and use that for merging there is a
>>> perfect match. (Now, I don't remember why I encoded this
>>> variable-there must have been a reason but that was definitely not
>>> aimed at merge.)
>>>
>>> Below is an example with two files being joined with string variable
>>> 'code' and encoded variable 'code1'--the latter results in a false
>>> perfect match. I wonder if this strange behavior of encoded variables
>>> is limited only to 'join' or could it be an issue also in other
>>> contexts (?). Thanks for any pointers.
>>>
>>> clear
>>> input id str5 code
>>> 1 "123J5"
>>> 2 "68741"
>>> 3 "297J5"
>>> 4 "14856"
>>> 5 "AB234"
>>> 6 "25K45"
>>> 7 "12535"
>>> end
>>> encode code, gen(code1)
>>> sort code1
>>> save file1.dta, replace
>>>
>>> clear
>>> input id str5 code
>>> 1 "243J5"
>>> 2 "68348"
>>> 3 "479H5"
>>> 4 "467G5"
>>> 5 "23TUB"
>>> 6 "TU501"
>>> 7 "32LK8"
>>> end
>>> encode code, gen(code1)
>>>
>>> joinby code1 using file1.dta, unmatched(both) /*perfect match*/
>>> *joinby code using file1.dta, unmatched(both) /*perfect non-match*
>>>
>>> ta _m
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index