Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: st: Matching samples in Stata
From 
 
David Kantor <[email protected]> 
To 
 
[email protected] 
Subject 
 
Re: st: Matching samples in Stata 
Date 
 
Thu, 11 Oct 2012 16:53:01 -0400 
Hi Paula,
At 01:40 PM 10/11/2012, you wrote:
HI David,
I finally got round to matching my sample.  I match the two samples 
on family education level and gender
mahapick ed_level_fam sex, idvar( "ID") genfile(D:\matched) 
nummatches(4) full treated(course)
where course is 1 for medicine and 0 for other - as in my analyses I 
want to compare medicine students vs. the others.  I created a file 
'matched' as I intend to import the relevant variables into it so 
that I can just run the analyses for this.
Ideally I want to only keep the first match.
However, when I check for duplicates using
duplicates list ID
I find that many of the matched respondents are the same for 
different medicine students.
Can you suggest what I am doing wrong and any way around this pls?
[...]
You are doing nothing wrong. Mahapick did what it was designed to do; 
it got the best 4 matches for each treated case -- with no regard for 
what is matched to other treated cases (similar to sampling with 
replacement). There is no guarantee that there will be uniqueness.
Incidentally, if you are checking for duplicates, you might want to try
 duplicates ...  if _matchnum==1
which will look at the best match for each treated case. That might 
be a better measure, as without filtering for _matchnum==1, you are 
comparing all matches; a given case might be, e.g., the first choice 
for one treated case, and the second (third, fourth) choice for 
another. (To clarify: _matchnum==1 gives the best choice for each 
treated case; _matchnum==2 is the second best choice; etc..)
There are two approaches to proceeding.
The first may seem like no approach at all, but we have done it in 
one of our studies using matched cases. Just use what you have, with 
the duplicates. But try to measure the duplication and report it 
along with your results.
For example, you might find that your matches sample is 85% unique. 
And that may be good enough.
The second is to do some kind of unique selection.
I did this somewhere, and if I can find it, I would let you have that 
code; I'll try to see if it can be located. The idea is...
        randomly choose a treated case
        select its closest match
        remove both the treated case and its match from the pool
        repeat this process on the remaining cases until all are matched.
The particular set you get will depend on a randomizaton of the 
selection order. That is, with Stata's random number generator, it 
will depend on the seed.
Note that this procedure will get one match per treated case. If you 
want more, say 3, then you repeat the whole process again and again. 
(It will help to have nummatches significantly larger than the number 
of desired final matches per treated case. In your example, you want 
1 match per treated case, so nummatches(4) is probably okay, but you 
might as well make it a bit higher.)
This randomization process is a pragmatic way to go. But there may be 
a more ideal goal, such as to minimize the total distance measures.
Doing that is very complicated for large numbers of cases; it's a 
subject that's open for research, much like the travelling 
salesperson problem. I can find some references and some R code that 
purports to do it. One reference mentions the use of network flow 
algorithms -- not that I know about that. But et me know if you want 
those references.
Just off the top of my head, one possibility is to start by taking 
all matches that are unique -- the ones that are not under contention 
to be matched to different treated cases. This may be a large portion 
of the cases. Then you only have to worry about the remaining smaller 
set. (That is, unless it works out that reassigning a non-contended 
match can result in a better overall result -- analogous to the case 
where, in the travelling salesperson problem, close cities are not 
visited in sequence.)
On one occasion, we wanted an optimized unique matching. We tried a 
process where we started with a given matching, and then iteratively 
swapped matches so as to minimize the total distance measure -- done 
outside of Stata. (Stata seemed awkward for the task; possibly Mata 
would do fine, but it wasn't available at that time.) Though we got 
an optimized set, the analytical results were no better than the 
original set. That is, after a lot of trouble and expense, the result 
was no better.
I hope this is useful. I will look for the random-selection code.
--David
P.S., With the randomization process, you can, say, do it three 
times, and run you analysis on the three matches.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/