Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

Re: st: Calculating Euclidean Distance

 From Anthony Laverty To statalist@hsphsun2.harvard.edu Subject Re: st: Calculating Euclidean Distance Date Fri, 11 Jun 2010 16:55:43 +0100

```I certainly don't indeed! You got it in one that i am matching on
patient volume pre-policy changes, using a dummy variable for before
and after it takes effect. I think i am indeed moving down the road of
estimating a few different ways on the data and in simulations, and
comparing the results, so your code and pointers toward -xtdpd- is

Many thanks
Anthony

On Fri, Jun 11, 2010 at 3:29 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>
> Well, you certainly don't want to match on your outcome variable, so I
> assume you are matching on patient volumes from the pre period, before
> any policy changes, and maybe you have a dummy t measuring whether a
> particular policy was instituted, and you have an outcome y which is
> patient volume at some later date.  Then define x1 to x12 for months 1
> to 12 of the pre period (or whatever months are in the pre period),
> and use -nnmatch- (remembering that you can get x1 to x12 from the
> data structure you outlined via -reshape- to wide form).  See also
> -help xtdpd- and related manual entries, if you want to compare to a
> regression taking account of the lagged dep var on the RHS.  But
> compare some other approaches:
>
> set seed 1234
> clear
> input str1 hospital time patients
>  A 1 456
>  A 2 759
>  A 3 236
>  B 1 214
>  B 2 854
>  B 3 325
>  C 1 250
>  C 2 321
>  C 3 852
> end
> * make more fake data
> expand 100
> ren patients x
> bys time (hospital): g g=_n
> drop hospital
> replace x=ceil(uniform()*x)
> reshape wide x, i(g) j(time)
> *make a fake treatment corr with observed x
> g byte t=(uniform()<x2/500)
> g y=ceil(x1^2+x2^2/2+x3^2/3+t+rnormal()*10)
> * estimate effect of treatment t with nnmatch or reg
> nnmatch y t x1-x3, met(maha) bias(bias) robust(4)
> reg y t
> reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3
> *now parametric propensity score reweighting
> qui logit t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3
> predict p
> g pw=cond(t,1/p,1/(1-p))
> reg y t [pw=pw]
> reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 [pw=pw]
> *now nonparametric propensity score reweighting
> forv i=1/3 {
>  xtile z`i'=x`i', nq(4)
>  }
> egen np=mean(t), by(z1 z2 z3)
> g npw=cond(t,1/np,1/(1-np))
> reg y t [pw=npw]
> reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 [pw=npw]
>
> The last, a double-robust approach with nonparametric propensity score
> reweighting, has a variety of proven advantages over alternatives.
> None has sufficient power, but some think they do...  you may want to
> design a simulation based on your data and some hypothesized treatment
> effects, to see what seems to have the lowest bias or MSE in your
> design.  Or just estimate 10 different ways, and hope you get similar
>
>
> On Fri, Jun 11, 2010 at 4:43 AM, Anthony Laverty
>> Fair enough, i didnt really give too much more away. After the
>> matching i am planning on running a difference in difference analysis
>> to assess for the effect of policy changes on patient numbers, using
>> the matches as a comparison group. Mahalanobis distance may in fact be
>> an improvement, so i will look that up
>>
>> Many thanks
>>
>> On Thu, Jun 10, 2010 at 4:50 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>>> You didn't give more detail on your problem--what are you going to use
>>> the matches for?  Why use the sum of squared differences in each
>>> month, as opposed to, say the Mahalanobis distance over all months
>>> (-reshape- to have T variables measuring # of patients in each month,
>>> and find the closest 15 obs in the standard deviation metric)?  That
>>> would match not only on levels but on seasonal patterns, for example.
>>> Is there a regression you plan to run after matching?  You may want to
>>> -findit nnmatch- in that case.
>>>
>>> On Thu, Jun 10, 2010 at 11:30 AM, Anthony Laverty
>>>> Hi Austin
>>>>
>>>> and perhaps using a log scale
>>>>
>>>> Unfortunately, what i really want to be able to do is choose a group
>>>> of hospitals (say 15) which are closest in Euclidean distance terms to
>>>> hospital A over all months, rather than just the one closest hospital.
>>>> I was planning to aggregate these for the whole of the time period at
>>>> the end, if that makes things any easier.
>>>>
>>>> In terms of more detail i'm not sure if it helps to say that this was
>>>> relatively easy to work out in excel, using a different column for
>>>> each time period; a row for each hospital and the number of patients
>>>> for each time period in a table like this. Then, it was quite easy to
>>>> work out the distances with the equation subtracting different
>>>> hospitals' numbers from each other, using if statements to match on
>>>> time. The new data i have is too big for Excel to do this, which is
>>>> why i have turned to stata (and statalist)
>>>>
>>>>
>>>> Anthony
>>>>
>>>>
>>>> On Thu, Jun 10, 2010 at 2:59 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>>>>> If you have N hospitals at T points in time, then you will have NTxN
>>>>> squared distances in your variables, and if they are doubles you may
>>>>> well run out of memory long before that, but if all you want is the
>>>>> nearest hospital, then you want one variable per hospital giving the
>>>>> identity of the nearest (over all months, you seem to suggest). You
>>>>> might also want to compute distance on a log scale, or some other
>>>>> metric. With more detail on your problem, you may get a better answer.
>>>>> Nevertheless, this is like what you asked for, I think:
>>>>>
>>>>> clear
>>>>> input str1 hospital time patients
>>>>>  A 1 456
>>>>>  A 2 759
>>>>>  A 3 236
>>>>>  B 1 214
>>>>>  B 2 854
>>>>>  B 3 325
>>>>>  C 1 250
>>>>>  C 2 321
>>>>>  C 3 852
>>>>> end
>>>>> egen g=group(hospital)
>>>>> su g, mean
>>>>> loc N=r(max)
>>>>> forv i=1/`N' {
>>>>>  g double d`i'=.
>>>>> }
>>>>> levelsof time, loc(ts)
>>>>> fillin time g
>>>>> sort time g
>>>>> g long obs=_n
>>>>> qui foreach t of loc ts {
>>>>>  su obs if time==`t', mean
>>>>>  loc n0=r(min)
>>>>>  loc n1=r(max)
>>>>>  forv i=`n0'/`n1' {
>>>>> loc n=`i'-`n0'+1
>>>>> replace d`n'=(patients-patients[`i'])^2 if inrange(_n,`n0',`n1')
>>>>>  }
>>>>> }
>>>>> l, sepby(time) noo
>>>>>
>>>>> On Thu, Jun 10, 2010 at 5:08 AM, Anthony Laverty
>>>>>> Dear Statalist
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have data on patient numbers at various hospitals and am trying to
>>>>>> calculate a new variable which is the Euclidean distance between one
>>>>>> specific hospital (say A) and all of the others, so that i can select
>>>>>> which hospitals had the most similar number of patients across all
>>>>>> months.  The data is more or less arranged like this (although it has
>>>>>> a few more columns not of direct interest to this question):
>>>>>>
>>>>>> Hospital     Time           Patients
>>>>>> A                 1                 456
>>>>>> A                 2                 759
>>>>>> A                 3                  236
>>>>>> B                 1                 214
>>>>>> B                 2                 854
>>>>>> B                 3                 325
>>>>>> C                 1                 250
>>>>>> C                  2                321
>>>>>> C                  3                852
>>>>>>
>>>>>>
>>>>>>
>>>>>> So, i want to cycle through each time period and calculate the
>>>>>> difference squared between hospital A and all of the other hospitals
>>>>>> individually as one new variable.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any suggestions greatly appreciated
>>>>>>
>>>>>>
>>>>>>
>>>>>> Anthony Laverty
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```