[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: Problems with -merge

From	[email protected] (William Gould, StataCorp LP)
To	[email protected]
Subject	Re: st: Problems with -merge
Date	Fri, 21 Mar 2008 16:07:21 -0500
BPM <[email protected]> is having difficulty merging datasets, and 
writes, 

> I need to match merge two datasets, one data is a subset of the other  
> data at the individual level and contains with extra variable fields.   
> The data read into STATA without issue, including attached labels. 
> [...]
> [...] I do not get an error message. I just get _merge = 1 or 2 and  
> not _merge = 3.
> 
> The variables to match include: lastname, firstname, ssn, incidentdate

Since you do not get an error message, and since _merge == 1 or 2, 
there is one thing you can be sure of, and that is that there is not 
one occurance of (lastname, firstname, ssn, incedentdate) in one 
dataset that appears in the other.  What you need to ask yourself is 
how that can be given that you think there are occurances that are in 
both.

Let's start with lastname and firstname.  Are they string variables?
They should be.  I ask because you mention "including attached labels" 
and I wonder if the two name fields are perhaps being run through 
value labels.  That could cause problems.  If they are being run through 
value labels, do the following in each dataset, 

        . decode lastname, gen(lname)
        . decode firstname, gen(fname) 
        . drop lastname firstname 
        . rename lname lastname 
        . rename fname firstname

Okay, now they are strings.  Now let's worry about leading or trailing 
blanks.  Do the following:

        . replace lastname = trim(lastname)
        . replace firstname = trim(firstname)

Also check capitalization.  Are they capitalized the same way between the 
two datasets?  If not, let's just convert to lower case:

        . replace lastname = lower(lastname)
        . replace firstname = lower(firstname)

Now what about variable ssn?  Is it string, numeric, or value-labeled numeric?
It should either be a string or a numeric without a value label, and if 
the latter, make sure it is a double, since only a double has sufficient 
number of digits to hold a US SSN without rounding.  If it is numeric and 
not a double, you will have to go back and read the original data again.
There is no undoing rounding errors after the fact.

If variable ssn is a string, then 

        . replace ssn = trim(ssn) 

just to make sure there are no leading or trailing blanks.  

Finally, we are on incidentdate.  Is that a string variable?  Fine if it 
is (but get rid of leading and trialing blanks), and realize that 
string variables only match if they are literally the same.  "March 3, 2008"
is not the same as "March 03, 2003", or "3/3/2003", etc.  If the formats 
are different, you will have to make incidenetdate into a Stata date before 
you will be able to match them.

I'm hoping something I just said will help, but I worry it will not because 
I'm just making shots in the dark.  So let's assume that nothing above 
solves the problem.  In that case, let's start simplifying the problem 
until it becomes obvious to us just exactly what is wrong.  With that 
knowledge, we can go back, fix the problem, and fix the two datasets.

Let me assume the original two datasets are called super.dta and sub.dta.
Do the following:

        . use super 
        . keep lastname 
        . sort lastname 
        . by lastname: keep if _n==1
        . save one

        . use sub
        . keep lastname 
        . sort lastname 
        . by lastname: keep if _n==1
        . save two

        . use one 
        . merge lastname using two
        . tabulate _merge 

The point of this experiment is to find out whether lastname ever merges.
Certainly some lastnames are the same between the two datasets.  What I 
did was keep each name (but only once) in each dataset, and then merge.

There are two possiblities from this experiment:  Either lastname did merge, 
and so we can dismiss variable lastname as the cause of the problem, or 
lastname did not merge.  If it did not merge, go digging in one.dta and 
two.dta, find an example that you think should have merged, and then figure 
out why they did not.

Now do the above experiment substituting firstname everywhere lastname 
appears.  Same questions, same procedure.

Now do the above experiment substituting ssn, and then do it again 
substituting incidencedate.

By now, I suspect you will have found the problem.  But let's assume 
you have not.  This time, let's repeat the experiment, but with 
firstname and lastname:

        . use super 
        . keep lastname firstname
        . sort lastname firstname
        . by lastname firstname: keep if _n==1
        . save one

        . use sub
        . keep lastname firstname
        . sort lastname firstname
        . by lastname firstname: keep if _n==1
        . save two

        . use one 
        . merge lastname firstname using two
        . tabulate _merge 

Did that work?  No?  Figure out why.  Yes?  Do it again, but with three 
variables, namely lastname, firstname, and ssn.  And then with all four.
Eventually, it will become obvious what the problem is.

Good luck. 

-- Bill
[email protected]
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: Re: st: Problems with -merge
Next by Date: Re: st: logit / odds ratio interpretation using -listcoef-
Previous by thread: Re: st: Problems with -merge
Next by thread: Re: st: Problems with -merge
Index(es):
- Date
- Thread