Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: merge m:1 by string
From
Rebecca Pope <[email protected]>
To
[email protected]
Subject
Re: st: merge m:1 by string
Date
Fri, 18 Mar 2011 18:45:18 -0500
Ben,
If this is real data from your sample, I'm not sure what is causing
your problem. I wasn't able to duplicate the issue you describe.
/***** begin code *****/
clear
input str32 name budget
"Alex T. Smith" 130
"Andrew J. Williams" 345
"Steve R. Jackson" 245
end
save using, replace
clear
input str32 name household1 date
"Alex T. Smith" 45 1988
"Alex T. Smith" 33 1977
"Andrew J. williams" 12 1999
"Andrew J. Williams" 12 2004
"Steve R. Jackson" 23 1979
end
merge m:1 name using using
list
/***** end code *****/
/**** output - apologies if this doesn't line up on your end... ****/
name househ~1 date budget _merge
-----------------------------------------------------------------
1. Alex T. Smith 45 1988 130 matched (3)
2. Alex T. Smith 33 1977 130 matched (3)
3. Andrew J. Williams 12 2004 345 matched (3)
4. Andrew J. williams 12 1999 . master only (1)
5. Steve R. Jackson 23 1979 245 matched (3)
/********/
As you can see, Stata matches everything except obs. #4 above, but
that's to be expected because "williams" is not equivalent "Williams";
Stata is case-sensitive.
Also, please verify that either (1) this produces the same results on
your computer or (2) that the same problem emerges even when you run
this code. Since you didn't specify, I'm assuming you are running
Stata 11.
If this code works for you, my guess is that there are differences in
your actual data you can't see by just "eyeballing" it. You say you
checked for leading spaces. Did you check for trailing ones?
As regards -encode-, I think you are using it incorrectly or at least
expecting it to be something it isn't. It is just going to generate a
numeric variable that takes a new value for each distinct value of the
string, there is no particular relationship between the numeric
variable and the string variable other than the order Stata
encountered the particular string value. Observe the results below
(code not shown) "ename" is encoded name in the master set & "ename_u"
is for using. As you can see, the encoded names are different for obs
4 & 5.
name househ~1 date ename budget ename_u _merge
--------------------------------------------------------------------------
1. Alex T. Smith 33 1977 1 130 1 3
2. Alex T. Smith 45 1988 1 130 1 3
3. Andrew J. Williams 12 2004 2 345 2 3
4. Andrew J. williams 12 1999 3 . . 1
5. Steve R. Jackson 23 1979 4 245 3 3
Hope this helps,
Rebecca
__o __o
_`\ <,_ _`\ <,_
(_)/ (_) (_)/ (_)
=========================
On Fri, Mar 18, 2011 at 5:21 PM, Ben Ammar <[email protected]> wrote:
>
> Hi everybody,
>
> I've got a problem concerning the merge-command or rather the result of it.
> I'd be very grateful for any help. There are more than 2 million names (%str32) in my master and 4000 names(%str32) in my using concerning the variable (name) I want to merge on. Since there are multiple observations with the same name in my master but only one unique observation in the using, the m:1 merge command supposed to be correct.
>
> master:
> name household1 date
>
> Alex T. Smith 45 1988
> Alex T. Smith 33 1977
> Andrew J. williams 12 1999
> Andrew J. Williams 12 2004
> Steve R. Jackson 23 1979
>
>
> using:
> name budget
>
> Alex T. Smith 130
> Andrew J. Williams 345
> Steve R. Jackson 245
>
>
> but what happens is that the using is appended at the end of the master after the merger. I think the problem here is the string variable even though I don't understand why. When I encoded the string variable (name) about 8000 observations (out of 2 million) in the master where matched just like it should be but unfortunately not yet enough. The format of the var in both data sets is the same and I even sorted them. I also checked if there's a space at the beginning of the name or if there's anything within the string that differs from the using-name but both string-variables are exactly the same. Last (unlikely) case I checked was the RAM by dropping all other variables which could have taken too much memory and therefore explain why a very little part was matched when trying to encode the string. That didn't work either. Does anyone have an idea on that or even made the same experience? Thanks for any comments!
>
> Regards
> Ben
>
>
> --
> NEU: FreePhone - kostenlos mobil telefonieren und surfen!
> Jetzt informieren: http://www.gmx.net/de/go/freephone
> *
> * For searches and help try:
> * http://www.stata.com/help.cgi?search
> * http://www.stata.com/support/statalist/faq
> * http://www.ats.ucla.edu/stat/stata/
*
* For searches and help try:
* http://www.stata.com/help.cgi?search
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/