Bookmark and Share

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Matching samples in Stata

From   David Kantor <>
Subject   Re: st: Matching samples in Stata
Date   Thu, 11 Oct 2012 16:53:01 -0400

Hi Paula,

At 01:40 PM 10/11/2012, you wrote:
HI David,

I finally got round to matching my sample. I match the two samples on family education level and gender

mahapick ed_level_fam sex, idvar( "ID") genfile(D:\matched) nummatches(4) full treated(course)

where course is 1 for medicine and 0 for other - as in my analyses I want to compare medicine students vs. the others. I created a file 'matched' as I intend to import the relevant variables into it so that I can just run the analyses for this.

Ideally I want to only keep the first match.

However, when I check for duplicates using

duplicates list ID

I find that many of the matched respondents are the same for different medicine students.

Can you suggest what I am doing wrong and any way around this pls?

You are doing nothing wrong. Mahapick did what it was designed to do; it got the best 4 matches for each treated case -- with no regard for what is matched to other treated cases (similar to sampling with replacement). There is no guarantee that there will be uniqueness.
Incidentally, if you are checking for duplicates, you might want to try
 duplicates ...  if _matchnum==1
which will look at the best match for each treated case. That might be a better measure, as without filtering for _matchnum==1, you are comparing all matches; a given case might be, e.g., the first choice for one treated case, and the second (third, fourth) choice for another. (To clarify: _matchnum==1 gives the best choice for each treated case; _matchnum==2 is the second best choice; etc..)

There are two approaches to proceeding.
The first may seem like no approach at all, but we have done it in one of our studies using matched cases. Just use what you have, with the duplicates. But try to measure the duplication and report it along with your results. For example, you might find that your matches sample is 85% unique. And that may be good enough.

The second is to do some kind of unique selection.
I did this somewhere, and if I can find it, I would let you have that code; I'll try to see if it can be located. The idea is...
        randomly choose a treated case
        select its closest match
        remove both the treated case and its match from the pool
        repeat this process on the remaining cases until all are matched.
The particular set you get will depend on a randomizaton of the selection order. That is, with Stata's random number generator, it will depend on the seed.

Note that this procedure will get one match per treated case. If you want more, say 3, then you repeat the whole process again and again. (It will help to have nummatches significantly larger than the number of desired final matches per treated case. In your example, you want 1 match per treated case, so nummatches(4) is probably okay, but you might as well make it a bit higher.)

This randomization process is a pragmatic way to go. But there may be a more ideal goal, such as to minimize the total distance measures. Doing that is very complicated for large numbers of cases; it's a subject that's open for research, much like the travelling salesperson problem. I can find some references and some R code that purports to do it. One reference mentions the use of network flow algorithms -- not that I know about that. But et me know if you want those references.

Just off the top of my head, one possibility is to start by taking all matches that are unique -- the ones that are not under contention to be matched to different treated cases. This may be a large portion of the cases. Then you only have to worry about the remaining smaller set. (That is, unless it works out that reassigning a non-contended match can result in a better overall result -- analogous to the case where, in the travelling salesperson problem, close cities are not visited in sequence.)

On one occasion, we wanted an optimized unique matching. We tried a process where we started with a given matching, and then iteratively swapped matches so as to minimize the total distance measure -- done outside of Stata. (Stata seemed awkward for the task; possibly Mata would do fine, but it wasn't available at that time.) Though we got an optimized set, the analytical results were no better than the original set. That is, after a lot of trouble and expense, the result was no better.

I hope this is useful. I will look for the random-selection code.

P.S., With the randomization process, you can, say, do it three times, and run you analysis on the three matches.

*   For searches and help try:

© Copyright 1996–2018 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   Site index