Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down on April 23, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
Will Hauser <whauseriii@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: merge creates duplicates in master data |

Date |
Mon, 26 Apr 2010 15:50:37 -0400 |

Michael,

Thanks William Hauser Michael Norman Mitchell wrote:

Dear WilliamI have approached these kinds of problems in the past, but haveapproached them in a different way with quite a bit of success. Pleasetake this for what it is worth, just a brainstorming idea or an ideafor a future approach. You may see it useful in your case, maybe not.Consider the two datasets, A and B that have the kind of informationthat you are describing. They may match perfectly, they may match tovarying degrees of imperfect matches. I would set up a series of matchcriteria, for example1. first name, last name, middle initial, regionMatches at this level would be consider a "quality 1" match. If aquality 1 match was not found, I would take the *unmatchedobservations* from each dataset, and submit them to a second matchcriteria, for example2. first name, last name, regionMatches at this level would be considered a "quality 2" match. If aquality 2 match was not found, I would take the *unmatchedobservations* (neither matched at quality 1 or quality 2) and then trya third round, for example3. first initial, last name, regionMatches at this level would be considerd a "quality 3" match. If thiswas the final match criteria, then I would consider the remainingunmatched to be "not found" and would manually inspect them lookingfor other ways that they could be matched. I would then append thematched records from "round 1", "round 2" and "round 3" and thosewould form the matched records.I don't know if this strategy is exactly helpful in your case. Ifnot, I hope it is something that you (or other Statalisters) may finduseful in the future. In fact, I think I will put this on my list of"to do" items for an upcoming Stata tidbit of the week.Best luck and best regards, Michael N. Mitchell See the Stata tidbit of the week at... http://www.MichaelNormanMitchell.com On 2010-04-25 7.42 PM, Will Hauser wrote:Hello all,I am experiencing unexpected behavior in Stata 10 when using themerge command.I am matching two lists based on a series of string variables (firstname, last name, initials) and one numeric region identifier. I havecarefully cleaned the string variables of excess spaces andpunctuation marks but they are inherently difficult to match as thename on one list may correspond to a nick name or abbreviation on theother (e.g. "WILLIAM" may correspond with "W" or "BILL"). Myapproach to this problem is to make multiple merges between the twolists each time using less information. For example, the first mergeuses first name, last name, and region. The second uses firstinitial, last name, and region. The third just last name and region(and so on). Since the master data is inviolate subsequentmismatches should never overwrite earlier 'good' matches. I am usingthe update option but not the replace option. I am not using theunique option since the variables do not uniquely identify the casesin either the master or the using.From what I can tell Stata is duplicating cases in the masterdataset. The end result is 10 pairs of duplicate entries that appearidentical in every way save for the _merge summary variable from thelast merge. The summary variable indicates using agrees with master(3) for one of the duplicates and indicates that using does not agreewith master for the other (5). There are no missing values in eitherlist and I can see nothing special about the entries that areduplicated. I have used the duplicates command to verify that theseduplicates are not present in the master data prior to merging.I assume this is not a bug but is rather something about the mergecommand I am misunderstanding and that concerns me very much. Iwould be happy to provide the lists and the relevant portion of thedo file if anyone is interested. The lists are public and are notunusually long (958 cases in the master and 593 cases in the using).Thanks for your insight, William Hauser * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

* * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**Follow-Ups**:**Re: st: merge creates duplicates in master data***From:*Michael Norman Mitchell <Michael.Norman.Mitchell@gmail.com>

**References**:**st: merge creates duplicates in master data***From:*Will Hauser <whauseriii@gmail.com>

**Re: st: merge creates duplicates in master data***From:*Michael Norman Mitchell <Michael.Norman.Mitchell@gmail.com>

- Prev by Date:
**Re: st: overspecification of logit model** - Next by Date:
**Re: st: Unpublicized changes to Saved Results for -reshape-** - Previous by thread:
**Re: st: merge creates duplicates in master data** - Next by thread:
**Re: st: merge creates duplicates in master data** - Index(es):