Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

st: Re: How to merge datasets when there are missing values in the matching variables

From	"Joseph Coveney" <[email protected]>
To	<[email protected]>
Subject	st: Re: How to merge datasets when there are missing values in the matching variables
Date	Sun, 22 Jan 2012 13:46:23 +0900

Shihying Yao wrote:

I am trying to merge two data files using two unique ID variables, ID1
and ID2. Note that not all of the subjects have both ID1 and ID2
information in both files. Suppose the names of the data files are
"master" and "subset." Below resembles the code I used:

use subset, clear
sort ID1 ID2
save subset,replace

use master, clear
sort ID1 ID2
merge ID1 ID2 using subset

The problem occurs for subjects whose ID1 information is missing in
one of the data files (either one). Although these subjects can be
uniquely identified using ID2 in both files, their records are not
merged and there are duplicate records (i.e., one record has both ID1
and ID2 information, while the other record has ID2 information and
ID1 missing) in the merged file. It doesn't help whether I sort ID1 or
ID2 first, since some subjects have ID2 information in only one file.

The version I am using is STATA 10. Any help is appreciated.

--------------------------------------------------------------------------------

You can try making your ID1 variable never-missing, substituting an
arbitrary-but-identifying constant (e.g., -99, "!?!?") for missing values
(blanks).  You can use -mvencode- to do this for numeric IDs and then -mvdecode-
afterwards.

If you're worried that this approach might accidentally corrupt your two ID
variables, then try concatenating ID1 and ID2 to a single, never-missing
(perhaps, temporary) ID variable, and then -merge- on that ID variable.  If your
two ID variables are numeric (integer), then it could be something like the
following:

use subset, clear
generate str ID = string(ID1) + "-" + string(ID2)
sort ID
tempfile subset
save `subset'

use master
generate str ID = string(ID1) + "-" + string(ID2)
sort ID
merge ID using `subset'
drop ID

I'd strive to make the concatenated ID variable of uniform length, by padding
the integers with zeros, -string(ID1, "%05.0f")- for example.  

If ID1 and ID2 are already string, then the concatenation step won't need to use
-string()-, of course.

Regardless, be sure to insert some kind of delimiter (hyphen, space etc.) to
avoid problems in situations where there are some ID2s that would otherwise
happen to be the same as some concatenated IDs:

ID1	ID2		ID
---	---		-------
	ABCDEF		-ABCDEF
ABC	DEF		ABC-DEF

Joseph Coveney


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- st: Re Re: How to merge datasets when there are missing values in the matching variables
  - From: "Joseph Coveney" <[email protected]>

References:
- st: How to merge datasets when there are missing values in the matching variables
  - From: shihying yao <[email protected]>

Prev by Date: st: using round time numbers on the x-axis of a stata graph
Next by Date: st: Re Re: How to merge datasets when there are missing values in the matching variables
Previous by thread: Re: st: How to merge datasets when there are missing values in the matching variables
Next by thread: st: Re Re: How to merge datasets when there are missing values in the matching variables
Index(es):
- Date
- Thread