Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merge m:1 by string


From   Rebecca Pope <[email protected]>
To   [email protected]
Subject   Re: st: merge m:1 by string
Date   Fri, 18 Mar 2011 18:45:18 -0500

Ben,
If this is real data from your sample, I'm not sure what is causing
your problem. I wasn't able to duplicate the issue you describe.

/***** begin code *****/
clear
input str32 name budget
"Alex T. Smith"         130
"Andrew J. Williams"    345
"Steve R. Jackson"      245
end
save using, replace

clear
input str32 name household1    date
"Alex T. Smith"         45          1988
"Alex T. Smith"         33          1977
"Andrew J. williams"    12          1999
"Andrew J. Williams"    12          2004
"Steve R. Jackson"      23          1979
end

merge m:1 name using using

list
/***** end code *****/

/**** output - apologies if this doesn't line up on your end... ****/

	name   househ~1   date   budget            _merge
	-----------------------------------------------------------------
1.	Alex T. Smith         45   1988      130       matched (3)
2.	Alex T. Smith         33   1977      130       matched (3)
3.	Andrew J. Williams         12   2004      345       matched (3)
4.	Andrew J. williams         12   1999        .   master only (1)
5.	Steve R. Jackson         23   1979      245       matched (3)

/********/
As you can see, Stata matches everything except obs. #4 above, but
that's to be expected because "williams" is not equivalent "Williams";
Stata is case-sensitive.

Also, please verify that either (1) this produces the same results on
your computer or (2) that the same problem emerges even when you run
this code. Since you didn't specify, I'm assuming you are running
Stata 11.

If this code works for you, my guess is that there are differences in
your actual data you can't see by just "eyeballing" it. You say you
checked for leading spaces. Did you check for trailing ones?

As regards -encode-, I think you are using it incorrectly or at least
expecting it to be something it isn't. It is just going to generate a
numeric variable that takes a new value for each distinct value of the
string, there is no particular relationship between the numeric
variable and the string variable other than the order Stata
encountered the particular string value. Observe the results below
(code not shown) "ename" is encoded name in the master set & "ename_u"
is for using. As you can see, the encoded names are different for obs
4 & 5.

	name   househ~1   date   ename   budget   ename_u   _merge
	--------------------------------------------------------------------------
1.	Alex T. Smith         33   1977       1      130         1        3
2.	Alex T. Smith         45   1988       1      130         1        3
3.	Andrew J. Williams         12   2004       2      345         2        3
4.	Andrew J. williams         12   1999       3        .         .        1
5.	Steve R. Jackson         23   1979       4      245         3        3

Hope this helps,
Rebecca



         __o                __o
      _`\ <,_            _`\ <,_
     (_)/   (_)          (_)/   (_)
=========================


On Fri, Mar 18, 2011 at 5:21 PM, Ben Ammar <[email protected]> wrote:
>
> Hi everybody,
>
> I've got a problem concerning the merge-command or rather the result of it.
> I'd be very grateful for any help. There are more than 2 million names (%str32) in my master and 4000 names(%str32) in my using concerning the variable (name) I want to merge on. Since there are multiple observations with the same name in my master but only one unique observation in the using, the m:1 merge command supposed to be correct.
>
> master:
> name               household1    date
>
> Alex T. Smith         45          1988
> Alex T. Smith         33          1977
> Andrew J. williams    12          1999
> Andrew J. Williams    12          2004
> Steve R. Jackson      23          1979
>
>
> using:
> name                 budget
>
> Alex T. Smith         130
> Andrew J. Williams    345
> Steve R. Jackson      245
>
>
> but what happens is that the using is appended at the end of the master after the merger. I think the problem here is the string variable even though I don't understand why. When I encoded the string variable (name) about 8000 observations (out of 2 million) in the master where matched just like it should be but unfortunately not yet enough. The format of the var in both data sets is the same and I even sorted them. I also checked if there's a space at the beginning of the name or if there's anything within the string that differs from the using-name but both string-variables are exactly the same. Last (unlikely) case I checked was the RAM by dropping all other variables which could have taken too much memory and therefore explain why a very little part was matched when trying to encode the string. That didn't work either. Does anyone have an idea on that or even made the same experience? Thanks for any comments!
>
> Regards
> Ben
>
>
> --
> NEU: FreePhone - kostenlos mobil telefonieren und surfen!
> Jetzt informieren: http://www.gmx.net/de/go/freephone
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index