Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: exact command for distance ?


From   Austin Nichols <austinnichols@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: RE: exact command for distance ?
Date   Fri, 11 Sep 2009 14:00:37 -0400

Roy--
I also have no problem downloading others' work, and my hard drive is
cluttered with the output of Jann, Baum, Schaffer, Jenkins, Cox, and
many others. I seem to use one of Ben Jann's programs every day.  And
one of my posted solutions on this topic requires downloading
-vincenty- (from SSC), which gives much better distance estimates,
though at a substantial time cost.

I am not even claiming that -distmatch- has no utility--no doubt many
will find it useful.  But I'm afraid I don't see your point in this
post at all--you claim in the help file that -distmatch- "take several
minutes to complete" for 3000 obs, and other methods take "days if not
weeks" yet the method that I have outlined in several posts is
entirely general (i.e. it can be customized to produce any range of
statistics for any range of neighbors, which no program can claim to
do) and runs faster than -distmatch- in many cases, e.g. by a factor
of four or five here:

clear
tempfile p h
range pid 1 3000 3000
set seed 123
g p_lat=32+uniform()*10
g p_lon=-120+uniform()*40
compress
save `p'
clear
input h_lat h_lon
42.103 -80.065
42.103 -80.065001
39.739 -75.605
37.499 -77.470
39.464 -77.962
27.844 -82.796
39.138 -84.509
38.271 -85.694
36.143 -86.805
35.832 -86.073
36.313 -87.676
33.505 -86.801
32.288 -90.258
41.628 -93.659
44.412 -103.469
40.807 -96.625
35.608 -91.265
31.292 -92.461
31.080 -97.348
31.783 -106.474
46.612 -112.048
43.618 -116.194
41.242 -110.991
34.568 -112.456
40.698 -111.990
40.044 -111.716
46.053 -118.356
33.860 -118.149
34.093 -118.344
41.628  .
end
g id=_n
expand 100
bys id: replace id=id*100+_n
isid id
replace h_lon=h_lon+uniform()
replace h_lat=h_lat+uniform()
compress
save `h'
su
timer on 1
use `p'
local np=_N
g double dnear=.
g long idnear=.
merge using `h'
local R=6367.44
forv i=1/`np' {
local x1=p_lat[`i']
local y1=p_lon[`i']
local x2 h_lat
local y2 h_lon
qui {
g double L=(`y2'-`y1')*_pi/180
replace L=(`y2'-`y1'-360)*_pi/180 if L<. & L>_pi
replace L=(`y2'-`y1'+360)*_pi/180 if L<-_pi
local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
su d, meanonly
replace dnear=r(min) in `i'
su id if d==r(min), meanonly
replace idnear=r(min) in `i'
}
drop L d
}
drop if _m==2
drop id h_lat h_lon _m
la var dnear "Distance to nearest hospital"
la var idnear "ID of nearest hospital"
timer off 1
l in 1/20, noo clean
tempfile m
save `m'
timer on 2
use `p'
ren p_lat h_lat
ren p_lon h_lon
append using `h'
distmatch, id(id) lat(h_lat) lon(h_lon) near(1) km
keep if pid<.
drop id h_lat h_lon
timer off 2
joinby pid using `m', unm(both)
ta _m
drop _m
order pid p_lat p_lon
l in 1/20, noo clean
timer list

*Plus, if you are working in a confidential data center, there is no
guarantee you can download a user-written program from SSC, so it
seems worthwhile that you can get a solution with a few lines of
regular code (one unmatched merge and a loop over observations).

On Fri, Sep 11, 2009 at 12:29 PM, Roy Wada<roywada@hotmail.com> wrote:
> I have no problem downloading and using programs written by other people. I
> have not used one of Austin's programs, but it is nice to know that it's there
> if I need it. Could I write a program on regression discontinuity from scatch?
> Sure. Give me two hours. But why would I? I am very grateful to those who have
> shared their program with me, and I hope they find my programs useful too.
>
> When a topic repeatedly come up on this list, it indicates an unaddressed
> problem. Distance matcing is a topic of growing importance that is appearing
> on this list with increasing frequency, despite the earlier assurance that
> this is not a problem in search of a solution.
>
> The solution implemented in -distmatch- is simple yet has never been implemented
> before. It is a non-intensive solution to an intensive problem. The number of
> matches that must be considered is N x N (it's actually N choose 2 repleated N
> or N-1 times, depending on how you count). This is a large number.
>
> The problem with proposing a simple solution is that some people like to entertain
> themselves thinking they could have done it on their own.
>
> But it has not been done before, and the problem keeps coming back to this list,
> as I said before.
>
> How difficult is distance matching? My first stab was remarkably similar to another
> program called -nearest- from ssc, which according to Nick Cox was not meant to
> earn a good grade in any computer science course. I am guessing this is the most
> obvious solution because this is also the one that Austin was suggesting.
>
> My first program literally took several months to run with observations of about
> 30,000. I tried paralleling the codes (multiple computers), -merge-, grid-searching,
> etc, before settling on the current form, which is at least 100 times faster than the
> first one.
>
> This rewriting of the program occurred over the course of two years. If
> someone can do this in one sitting, go ahead. Good for them.
>
> But anyone thinking that a casual user can be shown how to do this over the
> Statalist is wasting everyone's time, which was clearly the case.
>
> The current non-Stata solution, widely used by economists, is to use ArcGIS or
> ArcMap. They cost about $2000-$6000. They usually take about several days if
> not weeks of user-work. If you are using confidential data center (they usually
> charge by the hour), that's another $2000 in expenses. Good luck using the latest
> versions of these programs because they are even more difficult to use. Be grateful
> if you never had to use one of these.
>
> Roy
>
>> Laura--
>> You don't actually need to download anything to solve this kind of
>> problem, or much harder similar problems, as illustrated by e.g.
>> http://www.stata.com/statalist/archive/2009-07/msg00261.html
>> http://www.stata.com/statalist/archive/2007-01/msg00098.html
>> and similar posts.
>>
>> I particularly doubt the final claim in the help file for -distmatch-
>> in the paragraph "Distance matching is computationally intensive.
>> Observations of 3,000 may take several minutes to complete. Other
>> methods typically take days if not weeks and requires extensive
>> user-involvement."
>>
>> use farms, clear
>> local nf=_N
>> g double mindist=.
>> merge using waterbodies
>> local R=6367.44
>> qui forv i=1/`nf' {
>> local x1=farm_Y[`i']
>> local y1=farm_X[`i']
>> local x2 wat_Y
>> local y2 wat_X
>> g double L=(`y2'-`y1')*_pi/180
>> replace L=(`y2'-`y1'-360)*_pi/180 if L_pi
>> replace L=(`y2'-`y1'+360)*_pi/180 if L<-_pi
>> local t1 acos(sin(`x2'*_pi/180)*sin(`x1'*_pi/180)
>> g double d=`t1'+cos(`x2'*_pi/180)*cos(`x1'*_pi/180)*cos(L))*`R'
>> su d, meanonly
>> replace mindist=r(min) in `i'
>> drop L d
>> }
>> drop _m waterbody_ID wat_X wat_Y
>> la var mindist "Distance to center of nearest body of water"
>> or adapt as appropriate...
>
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index