Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.
From | Anthony Laverty <anthonylav@googlemail.com> |
To | statalist@hsphsun2.harvard.edu |
Subject | Re: st: Calculating Euclidean Distance |
Date | Fri, 11 Jun 2010 16:55:43 +0100 |
I certainly don't indeed! You got it in one that i am matching on patient volume pre-policy changes, using a dummy variable for before and after it takes effect. I think i am indeed moving down the road of estimating a few different ways on the data and in simulations, and comparing the results, so your code and pointers toward -xtdpd- is very helpful Many thanks Anthony On Fri, Jun 11, 2010 at 3:29 PM, Austin Nichols <austinnichols@gmail.com> wrote: > Anthony Laverty <anthonylav@googlemail.com> : > > Well, you certainly don't want to match on your outcome variable, so I > assume you are matching on patient volumes from the pre period, before > any policy changes, and maybe you have a dummy t measuring whether a > particular policy was instituted, and you have an outcome y which is > patient volume at some later date. Then define x1 to x12 for months 1 > to 12 of the pre period (or whatever months are in the pre period), > and use -nnmatch- (remembering that you can get x1 to x12 from the > data structure you outlined via -reshape- to wide form). See also > -help xtdpd- and related manual entries, if you want to compare to a > regression taking account of the lagged dep var on the RHS. But > compare some other approaches: > > set seed 1234 > clear > input str1 hospital time patients > A 1 456 > A 2 759 > A 3 236 > B 1 214 > B 2 854 > B 3 325 > C 1 250 > C 2 321 > C 3 852 > end > * make more fake data > expand 100 > ren patients x > bys time (hospital): g g=_n > drop hospital > replace x=ceil(uniform()*x) > reshape wide x, i(g) j(time) > *make a fake treatment corr with observed x > g byte t=(uniform()<x2/500) > g y=ceil(x1^2+x2^2/2+x3^2/3+t+rnormal()*10) > * estimate effect of treatment t with nnmatch or reg > nnmatch y t x1-x3, met(maha) bias(bias) robust(4) > reg y t > reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 > *now parametric propensity score reweighting > qui logit t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 > predict p > g pw=cond(t,1/p,1/(1-p)) > reg y t [pw=pw] > reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 [pw=pw] > *now nonparametric propensity score reweighting > forv i=1/3 { > xtile z`i'=x`i', nq(4) > } > egen np=mean(t), by(z1 z2 z3) > g npw=cond(t,1/np,1/(1-np)) > reg y t [pw=npw] > reg y t c.x1##c.x1 c.x2##c.x2 c.x3##c.x3 [pw=npw] > > The last, a double-robust approach with nonparametric propensity score > reweighting, has a variety of proven advantages over alternatives. > None has sufficient power, but some think they do... you may want to > design a simulation based on your data and some hypothesized treatment > effects, to see what seems to have the lowest bias or MSE in your > design. Or just estimate 10 different ways, and hope you get similar > answers! > > > On Fri, Jun 11, 2010 at 4:43 AM, Anthony Laverty > <anthonylav@googlemail.com> wrote: >> Fair enough, i didnt really give too much more away. After the >> matching i am planning on running a difference in difference analysis >> to assess for the effect of policy changes on patient numbers, using >> the matches as a comparison group. Mahalanobis distance may in fact be >> an improvement, so i will look that up >> >> Many thanks >> >> On Thu, Jun 10, 2010 at 4:50 PM, Austin Nichols <austinnichols@gmail.com> wrote: >>> Anthony Laverty <anthonylav@googlemail.com> : >>> You didn't give more detail on your problem--what are you going to use >>> the matches for? Why use the sum of squared differences in each >>> month, as opposed to, say the Mahalanobis distance over all months >>> (-reshape- to have T variables measuring # of patients in each month, >>> and find the closest 15 obs in the standard deviation metric)? That >>> would match not only on levels but on seasonal patterns, for example. >>> Is there a regression you plan to run after matching? You may want to >>> -findit nnmatch- in that case. >>> >>> On Thu, Jun 10, 2010 at 11:30 AM, Anthony Laverty >>> <anthonylav@googlemail.com> wrote: >>>> Hi Austin >>>> >>>> That's helpful, thanks, and good points about my memory considerations >>>> and perhaps using a log scale >>>> >>>> Unfortunately, what i really want to be able to do is choose a group >>>> of hospitals (say 15) which are closest in Euclidean distance terms to >>>> hospital A over all months, rather than just the one closest hospital. >>>> I was planning to aggregate these for the whole of the time period at >>>> the end, if that makes things any easier. >>>> >>>> In terms of more detail i'm not sure if it helps to say that this was >>>> relatively easy to work out in excel, using a different column for >>>> each time period; a row for each hospital and the number of patients >>>> for each time period in a table like this. Then, it was quite easy to >>>> work out the distances with the equation subtracting different >>>> hospitals' numbers from each other, using if statements to match on >>>> time. The new data i have is too big for Excel to do this, which is >>>> why i have turned to stata (and statalist) >>>> >>>> Thanks for your consideration >>>> >>>> Anthony >>>> >>>> >>>> On Thu, Jun 10, 2010 at 2:59 PM, Austin Nichols <austinnichols@gmail.com> wrote: >>>>> Anthony Laverty <anthonylav@googlemail.com> : >>>>> If you have N hospitals at T points in time, then you will have NTxN >>>>> squared distances in your variables, and if they are doubles you may >>>>> well run out of memory long before that, but if all you want is the >>>>> nearest hospital, then you want one variable per hospital giving the >>>>> identity of the nearest (over all months, you seem to suggest). You >>>>> might also want to compute distance on a log scale, or some other >>>>> metric. With more detail on your problem, you may get a better answer. >>>>> Nevertheless, this is like what you asked for, I think: >>>>> >>>>> clear >>>>> input str1 hospital time patients >>>>> A 1 456 >>>>> A 2 759 >>>>> A 3 236 >>>>> B 1 214 >>>>> B 2 854 >>>>> B 3 325 >>>>> C 1 250 >>>>> C 2 321 >>>>> C 3 852 >>>>> end >>>>> egen g=group(hospital) >>>>> su g, mean >>>>> loc N=r(max) >>>>> forv i=1/`N' { >>>>> g double d`i'=. >>>>> } >>>>> levelsof time, loc(ts) >>>>> fillin time g >>>>> sort time g >>>>> g long obs=_n >>>>> qui foreach t of loc ts { >>>>> su obs if time==`t', mean >>>>> loc n0=r(min) >>>>> loc n1=r(max) >>>>> forv i=`n0'/`n1' { >>>>> loc n=`i'-`n0'+1 >>>>> replace d`n'=(patients-patients[`i'])^2 if inrange(_n,`n0',`n1') >>>>> } >>>>> } >>>>> l, sepby(time) noo >>>>> >>>>> On Thu, Jun 10, 2010 at 5:08 AM, Anthony Laverty >>>>> <anthonylav@googlemail.com> wrote: >>>>>> Dear Statalist >>>>>> >>>>>> >>>>>> >>>>>> I have data on patient numbers at various hospitals and am trying to >>>>>> calculate a new variable which is the Euclidean distance between one >>>>>> specific hospital (say A) and all of the others, so that i can select >>>>>> which hospitals had the most similar number of patients across all >>>>>> months. The data is more or less arranged like this (although it has >>>>>> a few more columns not of direct interest to this question): >>>>>> >>>>>> Hospital Time Patients >>>>>> A 1 456 >>>>>> A 2 759 >>>>>> A 3 236 >>>>>> B 1 214 >>>>>> B 2 854 >>>>>> B 3 325 >>>>>> C 1 250 >>>>>> C 2 321 >>>>>> C 3 852 >>>>>> >>>>>> >>>>>> >>>>>> So, i want to cycle through each time period and calculate the >>>>>> difference squared between hospital A and all of the other hospitals >>>>>> individually as one new variable. >>>>>> >>>>>> >>>>>> >>>>>> Any suggestions greatly appreciated >>>>>> >>>>>> >>>>>> >>>>>> Anthony Laverty > > * > * For searches and help try: > * http://www.stata.com/help.cgi?search > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/