Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down on April 23, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

re: Re: st: Statistical Matching


From   "Ariel Linden, DrPH" <ariel.linden@gmail.com>
To   <statalist@hsphsun2.harvard.edu>
Subject   re: Re: st: Statistical Matching
Date   Wed, 20 Jul 2011 14:21:01 -0700

I agree with Austin (always!) that there is no reason why the propensity
score could not be used here. In fact, it probably makes more sense when you
have such a huge N.

I would suggest you look at -cem- (a user-written program by Matt Blackwell
and Gary King at Harvard). That program will allow you to match on several
variables or on the propensity score - your choice. I am not sure how well
it will perform in such a large dataset though.

Ariel

Date: Tue, 19 Jul 2011 12:38:43 -0400
From: Austin Nichols <austinnichols@gmail.com>
Subject: Re: st: Statistical Matching

Gillette, Ryan (Volunteer) <Ryan_K_Gillette@omb.eop.gov>:
You can still use propensity scores, defining dataset 1 as T=0 and
dataset 2 as T=1 and e.g. running a logit of T on X in the appended
datasets.  Without more detail, it is hard to offer specific advice.
No user-written software is required, but there is much available to
download.  You can define a multivariate distance metric and get the
minimum-distance observations as matches, or you can do exact matching
by simply sorting appropriately, resampling with replacement the
appropriate number of times to achieve identical marginal
distributions, and then doing an unmatched -merge-.  This is
especially easy if you have weights in each dataset that sum to the
same population total.  N.B. the -sort- can be used to match on one
continuous variable by rank within categories of discrete variables.

On Tue, Jul 19, 2011 at 12:26 PM, Gillette, Ryan (Volunteer)
<Ryan_K_Gillette@omb.eop.gov> wrote:
> Hello,
>
> I am trying to match comparable observations between two large datasets
(300,000 to 3 million observations, depending which ones I decide to use). I
am not trying to calculate a treatment effect, but rather identify the id
number or observation number of an observation's closest match. I am
matching across a few variables, some of which I want to weight more than
others in  terms of required precision. I don't think I will be able to use
a propensity score, as it doesn't seem appropriate for my task.
>
> Does anyone know a program in Stata that can do these things? I have used
-nnmatch- before, but with such a large dataset I worry it could take days
to process. Is there a way to speed it up? Any ideas would be much
appreciated!
>
> Thanks,
>
> Ryan

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index