Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Calculating Euclidean Distance


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: Calculating Euclidean Distance
Date   Thu, 10 Jun 2010 11:50:44 -0400

Anthony Laverty <anthonylav@googlemail.com> :
You didn't give more detail on your problem--what are you going to use
the matches for?  Why use the sum of squared differences in each
month, as opposed to, say the Mahalanobis distance over all months
(-reshape- to have T variables measuring # of patients in each month,
and find the closest 15 obs in the standard deviation metric)?  That
would match not only on levels but on seasonal patterns, for example.
Is there a regression you plan to run after matching?  You may want to
-findit nnmatch- in that case.

On Thu, Jun 10, 2010 at 11:30 AM, Anthony Laverty
<anthonylav@googlemail.com> wrote:
> Hi Austin
>
> That's helpful, thanks, and good points about my memory considerations
> and perhaps using a log scale
>
> Unfortunately, what i really want to be able to do is choose a group
> of hospitals (say 15) which are closest in Euclidean distance terms to
> hospital A over all months, rather than just the one closest hospital.
> I was planning to aggregate these for the whole of the time period at
> the end, if that makes things any easier.
>
> In terms of more detail i'm not sure if it helps to say that this was
> relatively easy to work out in excel, using a different column for
> each time period; a row for each hospital and the number of patients
> for each time period in a table like this. Then, it was quite easy to
> work out the distances with the equation subtracting different
> hospitals' numbers from each other, using if statements to match on
> time. The new data i have is too big for Excel to do this, which is
> why i have turned to stata (and statalist)
>
> Thanks for your consideration
>
> Anthony
>
>
> On Thu, Jun 10, 2010 at 2:59 PM, Austin Nichols <austinnichols@gmail.com> wrote:
>> Anthony Laverty <anthonylav@googlemail.com> :
>> If you have N hospitals at T points in time, then you will have NTxN
>> squared distances in your variables, and if they are doubles you may
>> well run out of memory long before that, but if all you want is the
>> nearest hospital, then you want one variable per hospital giving the
>> identity of the nearest (over all months, you seem to suggest). You
>> might also want to compute distance on a log scale, or some other
>> metric. With more detail on your problem, you may get a better answer.
>> Nevertheless, this is like what you asked for, I think:
>>
>> clear
>> input str1 hospital time patients
>>  A 1 456
>>  A 2 759
>>  A 3 236
>>  B 1 214
>>  B 2 854
>>  B 3 325
>>  C 1 250
>>  C 2 321
>>  C 3 852
>> end
>> egen g=group(hospital)
>> su g, mean
>> loc N=r(max)
>> forv i=1/`N' {
>>  g double d`i'=.
>> }
>> levelsof time, loc(ts)
>> fillin time g
>> sort time g
>> g long obs=_n
>> qui foreach t of loc ts {
>>  su obs if time==`t', mean
>>  loc n0=r(min)
>>  loc n1=r(max)
>>  forv i=`n0'/`n1' {
>> loc n=`i'-`n0'+1
>> replace d`n'=(patients-patients[`i'])^2 if inrange(_n,`n0',`n1')
>>  }
>> }
>> l, sepby(time) noo
>>
>> On Thu, Jun 10, 2010 at 5:08 AM, Anthony Laverty
>> <anthonylav@googlemail.com> wrote:
>>> Dear Statalist
>>>
>>>
>>>
>>> I have data on patient numbers at various hospitals and am trying to
>>> calculate a new variable which is the Euclidean distance between one
>>> specific hospital (say A) and all of the others, so that i can select
>>> which hospitals had the most similar number of patients across all
>>> months.  The data is more or less arranged like this (although it has
>>> a few more columns not of direct interest to this question):
>>>
>>> Hospital     Time           Patients
>>> A                 1                 456
>>> A                 2                 759
>>> A                 3                  236
>>> B                 1                 214
>>> B                 2                 854
>>> B                 3                 325
>>> C                 1                 250
>>> C                  2                321
>>> C                  3                852
>>>
>>>
>>>
>>> So, i want to cycle through each time period and calculate the
>>> difference squared between hospital A and all of the other hospitals
>>> individually as one new variable.
>>>
>>>
>>>
>>> Any suggestions greatly appreciated
>>>
>>>
>>>
>>> Anthony Laverty

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index