[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: calculating nearest neighbors; looping back to the beginning of observations

From   "Sarah Cohodes" <>
Subject   st: calculating nearest neighbors; looping back to the beginning of observations
Date   Wed, 10 Oct 2007 17:39:14 -0400

Dear Statalisters:

I have the longitude and latitude of each of my observations.  I'd
like to identify the 100 nearest neighbors of each observation, so I
can ultimately calculate some variables based on those nearest
neighbors, for example the average test score of the 100 nearest
neighbors.  I've identified  a strategy to do this, but I'm stuck
along the way.  However, if someone has another suggestion on how to
approach the problem, I'd really appreciate it, especially if it is
less computationally intensive, as I have over 100,000 observations.

Here's my strategy:
1. determine the distance between i and the next 101 j observations
2. determine the maximum distance of these 101 distances
3. replace the max distance with the 101st distance if the 101st
distance is not the largest distance
4. recalculate the 101st distance with 102nd distance (etc. etc.) and
keep if it is smaller than one of the first 100 distances and toss if

My relevant code so far:

*make 101 id and distance variables, fill in with first 101 id's and distances;
foreach n of numlist 1/101{;
gen id`n'=.;
gen dist`n'=.;
replace id`n'=id[_n+`n'];
replace dist`n'=sqrt(
*deal with last cases;

*find the maximum distance;
egen maxdist=rowmax(dist*)

*replace the max distance and corresponding id with the 101st distance
and id if the 101st distance is less than the max;
foreach n of numlist 1/101{;
replace id`n'=id101 if dist`n'==maxdist;
replace dist`n'=dist101 if dist`n'==maxdist;

I haven't written the code yet that loops through observations 102 to
_N, because I need to address my issue first.  My problem is dealing
with the final 100 observations and testing observations that are
above the current observation -- essentially I want my loop to go
"beyond" _N and return to the first and subsequent observations until
the ith observation within the same loop.  If I don't do this, I get
missing data in the last 100 observations, and cannot test the
distance between an observation and an earlier numbered observation.
Suggestions on how to do this?

Or better yet, suggestions on a better way to approach the issue as a whole?

Thanks very much.

Sarah Cohodes
Project for Policy Innovation in Education
Harvard Graduate School of Education
617.496.3408 (phone)
617.495.2614 (fax)
*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index