# Re: st: calculating nearest neighbors; looping back to the beginning of observations

 From "Austin Nichols" <[email protected]> To [email protected] Subject Re: st: calculating nearest neighbors; looping back to the beginning of observations Date Wed, 10 Oct 2007 18:21:39 -0400

```Sarah--
To identify the nearest 100 obs, you will need 100 new variables
holding the ID for each of those neighbors; then calculating the
additional variables will also be nontrivial.  Far better to calculate
whatever you need in a single loop over all observations.  See
http://www.stata.com/statalist/archive/2007-01/msg00079.html
for more detail.

The key is to calculate for each i the distance to all _N-1 not-i obs
and then sort by distance and then calculate summary stats on the
first 100 obs with an in 1/100 qualification.  Also you might want to
calculate distance using a spherical approximation to the Earth's
surface (but see -findit vincenty- for an ellipsoidal approximation).

On 10/10/07, Sarah Cohodes <[email protected]> wrote:
> Dear Statalisters:
>
> I have the longitude and latitude of each of my observations.  I'd
> like to identify the 100 nearest neighbors of each observation, so I
> can ultimately calculate some variables based on those nearest
> neighbors, for example the average test score of the 100 nearest
> neighbors.  I've identified  a strategy to do this, but I'm stuck
> along the way.  However, if someone has another suggestion on how to
> approach the problem, I'd really appreciate it, especially if it is
> less computationally intensive, as I have over 100,000 observations.
>
> Here's my strategy:
> 1. determine the distance between i and the next 101 j observations
> 2. determine the maximum distance of these 101 distances
> 3. replace the max distance with the 101st distance if the 101st
> distance is not the largest distance
> 4. recalculate the 101st distance with 102nd distance (etc. etc.) and
> keep if it is smaller than one of the first 100 distances and toss if
> not
>
> My relevant code so far:
>
> #delimit;
> *make 101 id and distance variables, fill in with first 101 id's and distances;
> foreach n of numlist 1/101{;
> gen id`n'=.;
> gen dist`n'=.;
> replace id`n'=id[_n+`n'];
> replace dist`n'=sqrt(
> ((longitude-longitude[_n+`n'])^2)+((latitude-latitude[_n+`n'])^2));
> *deal with last cases;
> };
>
> *find the maximum distance;
> egen maxdist=rowmax(dist*)
>
> *replace the max distance and corresponding id with the 101st distance
> and id if the 101st distance is less than the max;
> foreach n of numlist 1/101{;
> replace id`n'=id101 if dist`n'==maxdist;
> replace dist`n'=dist101 if dist`n'==maxdist;
> };
>
> I haven't written the code yet that loops through observations 102 to
> _N, because I need to address my issue first.  My problem is dealing
> with the final 100 observations and testing observations that are
> above the current observation -- essentially I want my loop to go
> "beyond" _N and return to the first and subsequent observations until
> the ith observation within the same loop.  If I don't do this, I get
> missing data in the last 100 observations, and cannot test the
> distance between an observation and an earlier numbered observation.
> Suggestions on how to do this?
>
> Or better yet, suggestions on a better way to approach the issue as a whole?
>
> Thanks very much.
> Sarah
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```