Re: st: calculating nearest neighbors; looping back to the beginningof observations

 From "David M. Drukker" To statalist@hsphsun2.harvard.edu Subject Re: st: calculating nearest neighbors; looping back to the beginningof observations Date Thu, 11 Oct 2007 09:54:54 -0500 (CDT)

Sarah Cohodes <sarah.cohodes@gmail.com> asked how to calculate a summary
statistic of a variable for the k nearest neighbors.

Austin Nichols <austinnichols@gmail.com> replied with a good solution.

We put the minindex() function into Mata to handle problems like Sarah's.

Below I outline a possible solution method using the minindex() function in
Mata.

The two advantages of this solution are that it is fast and that the
minindex() function returns exactly what you want, a vector of the indices
of the smallest distances.

I have appended a version of the code for a problem like Sarah's below.
The code
1) simulates some data;
2) copies the variables into Mata vectors; and
3) for each observation it
a) finds the vector of indices of the closest observations,
b) extracts the vector of the closest observations from y, and
c) calculates the mean of the closest observations in y.

To illustrate how the code works, ind, y[ind] and mean(y[ind]) are displayed. In adopting this code for her own use, Sarah could remove
these display statements.

To keep it simple, tied distances would expand the number of indices
returned by minindex() as discussed in help mata minindex().

I hope that this helps.

--David
ddrukker@stata.com

---------------------------Begin example code----------------------------------
version 10
clear all

set seed 12345
set obs 1000

gen x1 = uniform()
gen x2 = uniform()
gen y = invnormal(uniform()) + x1^2 + x2^2

mata:

x1 = st_data(., "x1") // put x1 variable into x1 vector
x2 = st_data(., "x2") // put x2 variable into x2 vector
y = st_data(., "y") // put y variable into y vector

n = rows(x1)
ind = . // initialize ind vector
w = . // initialize w vector

// loop over observations
// I am working over first 3
// observations for illustration
// purposes
// change the 3 to n for the full
// problem
for(i=1; i<=3; ++i) {
// calculate distance for i(th)
// observation
d = sqrt((x1:-x1[i]):^2 + (x2:-x2[i]):^2)
//put vector of minimum indices into
// ind, if no ties ignore w, if ties
// use w to handle ties
minindex(d, 5, ind, w)

// display ind
"ind is "
ind
// display corresponding values from y
"y extract is "
y[ind]
// calculate mean of appropriate
// values of y
mean(y[ind])
}

end
---------------------------End example code----------------------------------

*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/