Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merge creates duplicates in master data


From   Robert Picard <[email protected]>
To   [email protected]
Subject   Re: st: merge creates duplicates in master data
Date   Mon, 26 Apr 2010 10:52:37 -0400

Doing a match merge when neither master or or using datasets uniquely
identify observations rarely makes sense. I think the update option
only makes sense on a one-to-one or many-to-one merge. Here's an
example of how merge works when you perform a many-to-many match merge
with the update option:

*--------------------------- begin example -----------------------
version 10

clear
input str2(v1 v2 v3 v4)
aa bb c1 d1
aa bb c2 d2
aa bb c3 d3
aa bb c4 d4
aa bb c5 d5
aa yy c6 d6
end
sort v1 v2 v3
tempfile a
save "`a'"

clear
input str2(v1 v2 v3 v4)
aa bb "" e1
aa bb "" e1
aa bb c7 e2
aa xx c8 e3
end

sort v1 v2 v4
merge v1 v2 using "`a'", update
tab _merge

sort v1 v2 v3 v4
list , noobs sepby(v1 v2)
*--------------------- end example --------------------------

Observations from the master are matched, one at a time, with the ones
in the using dataset. As you reach "aa bb c7 e2" in the master, the
merge becomes a one-to-many merge until you run out of matching "aa
bb" values in the using.

You might want to take a look at -group_id- that I recently posted on
SSC. If you assign unique identifier codes in both your lists and then
append both list, you can use -group_id- to consolidate identifiers
when matching by a combination of other variables.

Hope this helps, Robert
http://robertpicard.com/


On Sun, Apr 25, 2010 at 10:42 PM, Will Hauser <[email protected]> wrote:
> Hello all,
>
> I am experiencing unexpected behavior in Stata 10 when using the merge
> command.
> I am matching two lists based on a series of string variables (first name,
> last name, initials) and one numeric region identifier.  I have carefully
> cleaned the string variables of excess spaces and punctuation marks but they
> are inherently difficult to match as the name on one list may correspond to
> a nick name or abbreviation on the other (e.g. "WILLIAM" may correspond with
> "W" or "BILL").  My approach to this problem is to make multiple merges
> between the two lists each time using less information.  For example, the
> first merge uses first name, last name, and region.  The second uses first
> initial, last name, and region.  The third just last name and region (and so
> on).  Since the master data is inviolate subsequent mismatches should never
> overwrite earlier 'good' matches.  I am using the update option but not the
> replace option.  I am not using the unique option since the variables do not
> uniquely identify the cases in either the master or the using.
>
> From what I can tell Stata is duplicating cases in the master dataset.  The
> end result is 10 pairs of duplicate entries that appear identical in every
> way save for the _merge summary variable from the last merge.  The summary
> variable indicates using agrees with master (3) for one of the duplicates
> and indicates that using does not agree with master for the other (5).
>  There are no missing values in either list and I can see nothing special
> about the entries that are duplicated.  I have used the duplicates command
> to verify that these duplicates are not present in the master data prior to
> merging.
>
> I assume this is not a bug but is rather something about the merge command I
> am misunderstanding and that concerns me very much.  I would be happy to
> provide the lists and the relevant portion of the do file if anyone is
> interested.  The lists are public and are not unusually long (958 cases in
> the master and 593 cases in the using).
>
> Thanks for your insight,
>
> William Hauser
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index