# st: finding a nearest neighbor using psmatch2 --- problem achieving covariate balance

 From Pierre Azoulay To statalist@hsphsun2.harvard.edu Subject st: finding a nearest neighbor using psmatch2 --- problem achieving covariate balance Date Tue, 3 Mar 2009 18:48:43 -0500

```Dear Statalisters:

I am facing a problem that is vexing, in that it should be easy to
solve. But I have been stuck on it for a while.

I have two groups of scientists. The treatment group has n
observations. The control group has about 500*n observations. There is
NO common support problem. Think of n in the thousands.
We want to find a nearest neighbor for each guy in the treatment
group. We want to do this non-parametrically. Though we have many
observables, these covariates do not predict at all whether a
scientist is treated or control. Therefore a propensity score approach

[a bit of context: the treatment is having a "superstar collaborator" that dies]

We are using psmatch2, using the nearest-neighbor mahalanobis option.
We are using 6 variables. let's call them x1, x2,...,x5, and log(y0).

psmatch2 treat, mahalanobis(x1 x2 x3 x4 x5 logy0)

y0 is the baseline stock of publications for our scientists. It is
very skewed. Hence the log transformation, which we thought might
improve the match. We care a lot about matching on y0 --- this is
basically the lagged dependent variable in our analysis.

With so many potential controls to choose from, the outcome of this
procedure is very good on x1, x2 through x5. These covariates are very
well balanced. Not so in the case of log(y0) or y0.
The mean of the treated is higher than that of the matched controls,
significantly so. And the problem lies in the right tail. The medians
line up exactly. So does the 75th percentile. It's in the top quartile
that things go wrong.

I said the underlying data does not suffer from a common support
problem. There are indeed lots of potential control guys with a stock
of pubs/y0 in the tail. At the same time, it is true that the right
tail is fatter among treated than in the population of potential
controls.
We could achieve balance on y0 if we were matching only on that. But

Does anyone know of a trick (another transformation besides the log?)
that might enable us to do better at identifying matches for the guys
"in the tail" of the distribution of y0?

Ideally, psmatch2 would enable the researcher to specify an exact
match on logy0, and a mahalanobis match on x1-x5. But it does not.

And I must confess that my limited programing skills do not enable me
to muck around the psmatch2 ado-file to create that option.

Any suggestion would be appreciated!

Sincerely,

Pierre

-------------------------------------------------------------------
Pierre Azoulay
Associate Professor of Strategy
Massachusetts Institute of Technology
Sloan School of Management
50 Memorial Drive — E52-555
Cambridge, MA 02142-1947

Tel [Sloan]: (617) 258-9766
Tel [NBER]: (617) 588-1464
Fax: (617) 253-2660

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```