Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: calculating nearest neighbors; looping back to the beginning of observations


From   "Sarah Cohodes" <sarah.cohodes@gmail.com>
To   statalist@hsphsun2.harvard.edu
Subject   Re: st: calculating nearest neighbors; looping back to the beginning of observations
Date   Thu, 1 Nov 2007 16:51:05 -0400

Though it's been a couple of weeks, I wanted to finish the
conversation about determining nearest neighbors with the eventual
solution that I implemented (sort of).  My thanks go to Austin and
David, but what I ultimately drew the most from was Bill Gould's
closest functions discussed here:
http://www.stata.com/statalist/archive/2007-10/msg00946.html.  This
solution was most appealing to me because calculating the variables I
needed (average test score, percentage race, etc) came from it most
organically.

The example code below calculates the average test score and percent
nonwhite of the three nearest neighbors by distance of each
observation (I decided to use the simple distance formula instead of
taking note of Austin's comments because the distances in the actual
data are within a single city).  Expanding it to include up to k
nearest neighbors seems like it would be easy.

However, I programmed myself into a corner, having not thought about
that fact that I have  150,000 observations, so I get memory issues
(see http://www.stata.com/support/faqs/win/winmemory.html, and perhaps
the limits of Mata itself), and can't create the necessary distance
matrix, which is square.  I still wanted to post the code below as I
think it is a good solution for fewer observations, and see if anyone
had suggestions for getting around this issue in my case.  I'm going
to return to David's suggestion, since it doesn't calculate the full
matrix at once and see if I can get to the eventual variables that I
need from it, and would appreciate any suggestions on that approach or
another approach.

Thanks,
Sarah

--------------------start example code-------------------------------------
input id longitude latitude nonwhite testscore
1 5 24 0 1300
2 12 20 1 1600
3 13 22 1 1000
4 14 21 1 1470
5 15 19 1 1200
6 2 27 0 1050
7 8 27 1 850
8 15 28 0 900
9 13 25 1 1140
10 8 26 0 1100

mata

/*set up closest functions, blatantly stolen from bill gould ///
see http://www.stata.com/statalist/archive/2007-10/msg00946.html*/

real colvector closest(D)
{
        c = J(rows(D), 1, 0)
        for (i=1; i<=rows(D); i++) {
                j_min = d_min = .
                for (j=1; j<=cols(D); j++) {
                        if (i!=j) {
                                if (D[i,j]<d_min) {
                                        j_min = j
                                        d_min = D[i,j]
                                }
                        }
                 }
                 c[i] = j_min
          }
          return(c)
}

real colvector closest2(D, c1)
{
        D2 = D
        for (i=1; i<=rows(D); i++) D2[i,c1[i]] = .
        return(closest(D2))
}

real colvector closest3(D, c1, c2)
{
        D2 = D
        for (i=1; i<=rows(D); i++) {
                D2[i,c1[i]] = .
                D2[i,c2[i]] = .
        }
        return(closest(D2))
}

real matrix C_of_c(c)
{
        C = J(rows(c), rows(c), 0)
        for (i=1; i<=rows(C); i++) C[i,c[i]] = 1
        return(C)
}

/*read in data to mata, create distance matrix*/

st_view(all=., ., .)
d=J(rows(all), rows(all), 0)

/*d is the distance matrix, the numbers in the distance formula refer the ///
column position of latitude and longitude*/
for (i=1; i<=rows(all); i++){
    for (j=1; j<=rows(all); j++){
        d[i,j]= sqrt( (all[i,2]:-all[j,2]):^2 + (all[i,3]:-all[j,3]):^2)
    }
}

/*apply closest functions to distance matrix, create 3 closest matrix*/
c1 = closest(d)
c2 =closest2(d, c1)
c3 =closest3(d, c1, c2)
C1 = C_of_c(c1)
C2 = C_of_c(c2)
C3 = C_of_c(c3)
top3=transposeonly(C1+C2+C3) /*transpose top matrix so that rows match ///
 to observations*/
/*make zeros missings*/
for (i=1; i<=rows(all); i++){
    for (j=1; j<=rows(all); j++){
         if (top3[i,j] == 0){
            top3[i,j] = .
        }
    }
}
/*average test score of three nearest neighbors*/
test =all[.,5]
testT3=test:*top3
avgtest = transposeonly((colsum(test:*top3)):/(colnonmissing(testT3)))

/*percentage non-white/asian of three nearest neighbors*/
nonw=all[.,4]
nonwT3=nonw:*top3
/*count non-missings to get denominator*/
denom2=colnonmissing(nonwT3)
/*make zeros to missing, then count to get numerator*/
for (i=1; i<=rows(all); i++){
    for (j=1; j<=rows(all); j++){
         if (nonwT3[i,j] == 0){
            nonwT3[i,j] = .
        }
    }
}
pernonwhite = transposeonly((colnonmissing(nonwT3)):/ denom2)

/*add vectors to stata data set*/
(void) st_addvar("float", "avgtestNN")
st_store(., "avgtestNN", avgtest)
(void) st_addvar("float", "pernonwNN")
st_store(., "pernonwNN", pernonwhite)

end
--------------------end example code-------------------------------------
On 10/11/07, David M. Drukker <ddrukker@stata.com> wrote:
>  Sarah Cohodes <sarah.cohodes@gmail.com> asked how to calculate a summary
> statistic of a variable for the k nearest neighbors.
>
> Austin Nichols < austinnichols@gmail.com> replied with a good solution.
>
> We put the minindex() function into Mata to handle problems like Sarah's.
>
> Below I outline a possible solution method using the minindex() function in
> Mata.
>
> The two advantages of this solution are that it is fast and that the
> minindex() function returns exactly what you want, a vector of the indices
> of the smallest distances.
>
> I have appended a version of the code for a problem like Sarah's below.
> The code
>         1) simulates some data;
>         2) copies the variables into Mata vectors; and
>         3) for each observation it
>                  a) finds the vector of indices of the closest observations,
>                  b) extracts the vector of the closest observations from y, and
>                 c) calculates the mean of the closest observations in y.
>
> To illustrate how the code works, ind, y[ind] and mean(y[ind]) are
> displayed.  In adopting this code for her own use, Sarah could remove
> these display statements.
>
> To keep it simple, tied distances would expand the number of indices
> returned by minindex() as discussed in help mata minindex().
>
> I hope that this helps.
>
>      --David
>        ddrukker@stata.com
>
> ---------------------------Begin example code----------------------------------
> version 10
> clear all
>
> set seed 12345
> set obs 1000
>
> gen x1 = uniform()
> gen x2 = uniform()
> gen y  = invnormal(uniform()) + x1^2 + x2^2
>
>
> mata:
>
>         x1 = st_data(., "x1")           // put x1 variable into x1 vector
>         x2 = st_data(., "x2")           // put x2 variable into x2 vector
>         y  = st_data(., "y")            // put y  variable into y  vector
>
>         n = rows(x1)
>         ind = .                         // initialize ind vector
>         w   = .                         // initialize w   vector
>
>                                         // loop over observations
>                                         // I am working over first 3
>                                         // observations for illustration
>                                         // purposes
>                                         // change the 3 to n for the full
>                                         // problem
>         for(i=1; i<=3; ++i) {
>                                         // calculate distance for i(th)
>                                         // observation
>                 d = sqrt((x1:-x1[i]):^2 + (x2:-x2[i]):^2)
>                                         //put vector of minimum indices into
>                                         // ind, if no ties ignore w, if ties
>                                         // use w to handle ties
>                 minindex(d, 5, ind, w)
>
>                                         // display ind
>                 "ind is "
>                 ind
>                                         // display corresponding values from y
>                 "y extract is "
>                 y[ind]
>                                         // calculate mean of appropriate
>                                         // values of y
>                 mean(y[ind])
>         }
>
> end
> ---------------------------End example code----------------------------------
>
>
>
>
> *
> *   For searches and help try:
> *   http://www.stata.com/support/faqs/res/findit.html
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index