Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: merging datasets


From   Will Hauser <[email protected]>
To   [email protected]
Subject   Re: st: merging datasets
Date   Wed, 10 Nov 2010 22:38:05 -0500

Mike,
I have worked extensively with matching string and numeric variables across several data sets. It is a pain. Michael Mitchell has given you good advice I would build on it by noting the following:
(sorry, this will be lengthy)
Let me summarize by saying - do NOT use the update option, watch for duplicates, and remove matches from each step.

1. With large data sets such as yours (and mine) the joinby command creates a mess. I don't trust myself to sort it out with logical commands. The best approach is to first make matches that are sure things (i.e. matches on all variables or those that really matter - first name, last name, and birthday). Hopefully the number of matches remaining that were not 'perfect' is a manageable number so that you can do some hand checking with the help of some dummy variables to identify what info does not match.

2. You should know the merge command, at least in stata 10, will duplicate entries in the master file if there are multiple matches in the using data set that match the case in the master according to the match criteria you set. Stata does this *even though* the master file is "held inviolate." So, if you match based on last name only you'll wind up with several duplicate entries for "Smith" where each Smith in the master is match to every other "Smith" in the using data set. You'll want a unique identifier for each case in the master so you can see what names end up being duplicated (command: duplicates list "idvar"). Also do a count before the merge and make sure you end up back at that number at the end of the process.

3. Do the matching process across multiple 'passes'. First pass uses the most match variables and last pass uses the least. Choose the match variables carefully so that earlier passes are more trustworthy than later passes (e.g. the pass based on last name and birthday should probably come before the pass based on first initial and last name). When you have a match, remove it from the master (and in your case probably the using file as well). This is because subsequent matches will use less info and there is no reason to think that they will be a better match than a match from an earlier pass where more of the variables matched. When you are done with each merge just save the file to a temporary file name, keep only the matches, and then save the file as "first pass matches" or some equivalent. Then re-open the temp file, keep only the unmatched (i.e. _merge==1) and save as the "master data" or some equivalent. When you are done just append all the saved matches data sets together. *To keep things orderly I like to include the word "master" in the filename for the master dataset and "using" for the filename for the using dataset. The syntax can become cumbersome and it's nice to quickly see that I'm using the right datasets in the right places.

4. Clean your string variables carefully and thoroughly. Look out for individuals with hyphenated names, capitalizations (i.e. McCory), or spaces (i.e. St Joseph). Be sure suffixes aren't contaminating the last name field. Look out for extra spaces (leading, trailing, or internal) and make the capitalization uniform (all caps or all lowercase). Use the string functions, in particular regexm, trim, itrim, and the user written code 'strip'. You can't clean your string variables enough!

5. In my experience first names tend to vary - one database may have nicknames while the other may have formal names. Resolve this with a variable for first initial. First match using the full first name along with the other variables then the next pass should use the first initial instead along with the other variables. You may want to do this with birth dates as well - first use the actual date and then just use the year (or month and year). Obviously make sure you convert dates from string format into stata's numeric date format.

6. Name and label your _merge variable from each pass so you know where the match came from, how trustworthy it is, what variables would've matched, etc. These are extremely handy when hand checking or writing logical statements to drop bad matches.

7. With a large data set the merge variables probably won't uniquely identify the cases. This same problem applies to the sort command. For this reason when you sort your data sets use the 'stable' option! If you are like me and like to be able to go back and re-run the do file to make changes or corrections in light of errors you find later on you may have a real mess on you hands if you don't use the stable option (I found out the hard way). Have a unique sort order at the start (i.e. sort by all the variables) and then always specify the stable option with subsequent sorts. I use numeric ids to refer to each case (i.e. when dropping mismatches) and these ids would change each time I ran the do file because the order of the dataset was not constant when I created the ids after the match process.

8. Keep track of duplicates in the master dataset. You can do this with the egen 'group' command. Generate groups for each set of variables that you match on. For any group consisting of more then one tag it as duplicated on varlist (your march variables) using a dummy variable. This is because some matches are not necessarily bad matches but they may be uncertain matches. In short, they may match based on the variables you have data for but, because there are duplicates on those variables, you can't say for sure if the match is correct or not. For example if you have 2 John Smiths who are both missing a birthday and you have 2 John Smiths in the using data set who each have a birthday then a merge will link these up (and generate 2 duplicates) but in the end you have to throw these matches out because you don't know which case in the master goes with which case in the using. Having a variable that identifies cases that have duplicates based on first and last name will help you write a logical command to drop these uncertain matches.

9. Last and finally, do NOT use the update command! Let's say you have a Franklin Smith in the master file, he is missing his birthday. Now you match based on last name and first initial and get a match for F Smith in the using data set (he has a birthday listed). The data in the master file is not overwritten so only the missing data is carried over into the master and your Franklin Smith now has a birthday. Sounds great but what you cannot see is that Franklin "F Smith" matched to Frederick "F Smith" in the using file - almost certainly not the same character but you cannot discern this info from the mater dataset. For this reason **the update command is a baaaad idea unless you are 100% confident the match is and can only be correct.**

When a variable in the master has missing data you can get that data into the master from the using by opening the using data set before the merge, and duplicating that variable with a slightly different name (e.g. varname_using). Then, once you've eliminated bad matches you can just copy that data from varname_using to varname if varname=="". ***Because the merge will bring varname_using into the master file make sure you drop it for the *unmatched cases* before the next match (you can do this when you open the master to sort it before the next match).

One more tip: depending on the complexity of the task, the directory structure, and your cleverness you may be able to use a foreach loop to do the merges.

If you have additional questions about anything I've written here please feel free to contact me personally. I'm not a stata guru but have used it extensively for precisely the sort of task that you are undertaking with some very large data sets (e.g. voter rolls). If you'd like to see some of my do files I've used for this purpose I'd be happy to share. Caveat emptor.

Will

On 11/8/2010 11:50 PM, Michael Eisenberg wrote:
Colleagues,

I have a database of about 20K men that I'd like to merge with another
database.  I have names (first, middle, and last) as well as date of
birth and social security number for most men.  Unfortunately, the
original database has some missing data on birthdate and social
security numbers.  The new database has most of the birthdate info as
well as the geographic information that I need.

Some men do have the same name.

Is there anyway to merge based on name if it doesn't uniquely identify
men?  I'd like to somehow match all men and then let me manually
compare based on visit dates to decide if it's likely the match is
correct.  If not, any suggestions?

Thanks for you help.

Mike
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index