Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Random merging


From   Austin Nichols <[email protected]>
To   [email protected]
Subject   Re: st: Random merging
Date   Fri, 31 Jul 2009 14:10:37 -0400

Anna Dijkstra <[email protected]> :
I would think you could match on some observables (also google
"statistical matching") but you can try:

clear all
tempfile master using
input linkidx dupersid rxrecidx
020183 02019 1
020152 02019 2
110161 11010 1
110161 11010 3
end
save `using'
clear all
input linkidx evntidx eventyr eventmm eventdd
020183 1 2006 8 6
020152 1 2006 8 6
110161 5 2006 4 10
110161 2 2006 7 19
110161 8 2006 5 8
end
g obs=_n
save `master'

set seed 101
use `master', clear
joinby linkidx using `using', unm(both)
drop _m
format *idx %9.0f
egen N=count(linkidx), by(linkidx obs)
g u=ceil(uniform()*N)
bys obs linkidx: g ok=u[1]
by obs linkidx: keep if (_n==ok)
drop N ok u
sort obs
li linkidx evntidx rxrecidx, noo



On Fri, Jul 31, 2009 at 3:11 AM, Michael I. Lichter<[email protected]> wrote:
> Anna,
>
> If this doesn't do what you want, you need to be more specific about your
> needs:
>
> ------
> use file2, clear
> set seed 20090730
> gen myorder = runiform()
> sort LINKIDX myorder
> tempfile file2tmp
> save `file2tmp'
> use file1
> merge LINKINDX using `file2tmp', sort
> drop myorder
> ------
>
> Michael
>
> [email protected] wrote:
>>
>> Hi all,
>>  I'm a relatively new STATA user, and I'm trying to merge a couple of
>> large datasets where neither the master nor the using dataset has a unique
>> key.  The data comes in this format:
>>  Dataset 1:  (note that LINKIDX is not unique)
>>       EVNTIDX          LINKIDX           EVENTYR      EVENTMM    EVENTDD
>>  ...
>> 1.  300020190021   300020190083    2006                 8
>>     6
>> 2.  300020190021   300020190052    2006                 8
>>     6 3.  300110100795   300110101161    2006                 4
>>        10
>> 4.  300110100822   300110101161    2006                 7
>>    19
>> 5.  300110100808   300110101161    2006                 5
>>     8
>>
>> Dataset 2:  (note that LINKIDX is not unique)      LINKIDX
>>  DUPERSID     RXRECIDX  ...
>> 1. 300020190083     30002019        300020190083001
>> 2. 300020190083     30002019        300020198849002
>> 3. 300110101161     30011010        300110101161001
>> 4. 300110101161     30011010        300110101161003
>>
>>  I have already performed a merge where I have limited dataset 1 to only
>> the unique observations of LINKIDX, and linked them to the multiple
>> observations in dataset 2 (using a one-to-many merge). In the case of the
>> above datasets, it would involve linking observation 1 in dataset 1 to
>> observations 2 and 3 in dataset 2.  However, I would like to perform a
>> random link for the remaining observations. That is, for observations 3-5 in
>> dataset 1, which match the LINKIDX for observations 3 and 4 in dataset 2, I
>> would like for STATA to randomly pick a LINKIDX in dataset 1 to merge with
>> each matching LINKIDX in dataset 2.  I am not sure whether I should simply
>> use the merge function, because it may result in systematic selection of one
>> observation in dataset 1.

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2024 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index