Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Matching samples in Stata

From	David Kantor <[email protected]>
To	[email protected]
Subject	Re: st: Matching samples in Stata
Date	Thu, 11 Oct 2012 16:53:01 -0400

Hi Paula,

At 01:40 PM 10/11/2012, you wrote:

HI David,
I finally got round to matching my sample. I match the two sampleson family education level and gender
mahapick ed_level_fam sex, idvar( "ID") genfile(D:\matched)nummatches(4) full treated(course)
where course is 1 for medicine and 0 for other - as in my analyses Iwant to compare medicine students vs. the others. I created a file'matched' as I intend to import the relevant variables into it sothat I can just run the analyses for this.
Ideally I want to only keep the first match.

However, when I check for duplicates using

duplicates list ID
I find that many of the matched respondents are the same fordifferent medicine students.
Can you suggest what I am doing wrong and any way around this pls?
[...]

You are doing nothing wrong. Mahapick did what it was designed to do;it got the best 4 matches for each treated case -- with no regard forwhat is matched to other treated cases (similar to sampling withreplacement). There is no guarantee that there will be uniqueness.

Incidentally, if you are checking for duplicates, you might want to try
 duplicates ...  if _matchnum==1

which will look at the best match for each treated case. That mightbe a better measure, as without filtering for _matchnum==1, you arecomparing all matches; a given case might be, e.g., the first choicefor one treated case, and the second (third, fourth) choice foranother. (To clarify: _matchnum==1 gives the best choice for eachtreated case; _matchnum==2 is the second best choice; etc..)


There are two approaches to proceeding.

The first may seem like no approach at all, but we have done it inone of our studies using matched cases. Just use what you have, withthe duplicates. But try to measure the duplication and report italong with your results.For example, you might find that your matches sample is 85% unique.And that may be good enough.


The second is to do some kind of unique selection.

I did this somewhere, and if I can find it, I would let you have thatcode; I'll try to see if it can be located. The idea is...

        randomly choose a treated case
        select its closest match
        remove both the treated case and its match from the pool
        repeat this process on the remaining cases until all are matched.

The particular set you get will depend on a randomizaton of theselection order. That is, with Stata's random number generator, itwill depend on the seed.

Note that this procedure will get one match per treated case. If youwant more, say 3, then you repeat the whole process again and again.(It will help to have nummatches significantly larger than the numberof desired final matches per treated case. In your example, you want1 match per treated case, so nummatches(4) is probably okay, but youmight as well make it a bit higher.)

This randomization process is a pragmatic way to go. But there may bea more ideal goal, such as to minimize the total distance measures.Doing that is very complicated for large numbers of cases; it's asubject that's open for research, much like the travellingsalesperson problem. I can find some references and some R code thatpurports to do it. One reference mentions the use of network flowalgorithms -- not that I know about that. But et me know if you wantthose references.

Just off the top of my head, one possibility is to start by takingall matches that are unique -- the ones that are not under contentionto be matched to different treated cases. This may be a large portionof the cases. Then you only have to worry about the remaining smallerset. (That is, unless it works out that reassigning a non-contendedmatch can result in a better overall result -- analogous to the casewhere, in the travelling salesperson problem, close cities are notvisited in sequence.)

On one occasion, we wanted an optimized unique matching. We tried aprocess where we started with a given matching, and then iterativelyswapped matches so as to minimize the total distance measure -- doneoutside of Stata. (Stata seemed awkward for the task; possibly Matawould do fine, but it wasn't available at that time.) Though we gotan optimized set, the analytical results were no better than theoriginal set. That is, after a lot of trouble and expense, the resultwas no better.


I hope this is useful. I will look for the random-selection code.
--David

P.S., With the randomization process, you can, say, do it threetimes, and run you analysis on the three matches.


*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- Re: st: Matching samples in Stata
  - From: Paula Arce <[email protected]>

References:
- st: Matching samples in Stata
  - From: Paula Arce <[email protected]>
- Re: st: Matching samples in Stata
  - From: David Kantor <[email protected]>
- Re: st: Matching samples in Stata
  - From: Paula Arce <[email protected]>
- Re: st: Matching samples in Stata
  - From: David Kantor <[email protected]>
- Re: st: Matching samples in Stata
  - From: Paula Arce <[email protected]>

Prev by Date: Re: st: two selection equations followed by mlogit?
Next by Date: Re: st: ordered logistic regression with endogenous variable
Previous by thread: Re: st: Matching samples in Stata
Next by thread: Re: st: Matching samples in Stata
Index(es):
- Date
- Thread