Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: How to merge datasets when there are missing values in the matching variables


From   "Joseph Coveney" <jcoveney@bigplanet.com>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: Re: How to merge datasets when there are missing values in the matching variables
Date   Sun, 22 Jan 2012 13:46:23 +0900

Shihying Yao wrote:

I am trying to merge two data files using two unique ID variables, ID1
and ID2. Note that not all of the subjects have both ID1 and ID2
information in both files. Suppose the names of the data files are
"master" and "subset." Below resembles the code I used:

use subset, clear
sort ID1 ID2
save subset,replace

use master, clear
sort ID1 ID2
merge ID1 ID2 using subset

The problem occurs for subjects whose ID1 information is missing in
one of the data files (either one). Although these subjects can be
uniquely identified using ID2 in both files, their records are not
merged and there are duplicate records (i.e., one record has both ID1
and ID2 information, while the other record has ID2 information and
ID1 missing) in the merged file. It doesn't help whether I sort ID1 or
ID2 first, since some subjects have ID2 information in only one file.

The version I am using is STATA 10. Any help is appreciated.

--------------------------------------------------------------------------------

You can try making your ID1 variable never-missing, substituting an
arbitrary-but-identifying constant (e.g., -99, "!?!?") for missing values
(blanks).  You can use -mvencode- to do this for numeric IDs and then -mvdecode-
afterwards.

If you're worried that this approach might accidentally corrupt your two ID
variables, then try concatenating ID1 and ID2 to a single, never-missing
(perhaps, temporary) ID variable, and then -merge- on that ID variable.  If your
two ID variables are numeric (integer), then it could be something like the
following:

use subset, clear
generate str ID = string(ID1) + "-" + string(ID2)
sort ID
tempfile subset
save `subset'

use master
generate str ID = string(ID1) + "-" + string(ID2)
sort ID
merge ID using `subset'
drop ID

I'd strive to make the concatenated ID variable of uniform length, by padding
the integers with zeros, -string(ID1, "%05.0f")- for example.  

If ID1 and ID2 are already string, then the concatenation step won't need to use
-string()-, of course.

Regardless, be sure to insert some kind of delimiter (hyphen, space etc.) to
avoid problems in situations where there are some ID2s that would otherwise
happen to be the same as some concatenated IDs:

ID1	ID2		ID
---	---		-------
	ABCDEF		-ABCDEF
ABC	DEF		ABC-DEF

Joseph Coveney


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index