[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
"Sarah Cohodes" <sarah.cohodes@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: calculating nearest neighbors; looping back to the beginning of observations |

Date |
Thu, 1 Nov 2007 16:51:05 -0400 |

Though it's been a couple of weeks, I wanted to finish the conversation about determining nearest neighbors with the eventual solution that I implemented (sort of). My thanks go to Austin and David, but what I ultimately drew the most from was Bill Gould's closest functions discussed here: http://www.stata.com/statalist/archive/2007-10/msg00946.html. This solution was most appealing to me because calculating the variables I needed (average test score, percentage race, etc) came from it most organically. The example code below calculates the average test score and percent nonwhite of the three nearest neighbors by distance of each observation (I decided to use the simple distance formula instead of taking note of Austin's comments because the distances in the actual data are within a single city). Expanding it to include up to k nearest neighbors seems like it would be easy. However, I programmed myself into a corner, having not thought about that fact that I have 150,000 observations, so I get memory issues (see http://www.stata.com/support/faqs/win/winmemory.html, and perhaps the limits of Mata itself), and can't create the necessary distance matrix, which is square. I still wanted to post the code below as I think it is a good solution for fewer observations, and see if anyone had suggestions for getting around this issue in my case. I'm going to return to David's suggestion, since it doesn't calculate the full matrix at once and see if I can get to the eventual variables that I need from it, and would appreciate any suggestions on that approach or another approach. Thanks, Sarah --------------------start example code------------------------------------- input id longitude latitude nonwhite testscore 1 5 24 0 1300 2 12 20 1 1600 3 13 22 1 1000 4 14 21 1 1470 5 15 19 1 1200 6 2 27 0 1050 7 8 27 1 850 8 15 28 0 900 9 13 25 1 1140 10 8 26 0 1100 mata /*set up closest functions, blatantly stolen from bill gould /// see http://www.stata.com/statalist/archive/2007-10/msg00946.html*/ real colvector closest(D) { c = J(rows(D), 1, 0) for (i=1; i<=rows(D); i++) { j_min = d_min = . for (j=1; j<=cols(D); j++) { if (i!=j) { if (D[i,j]<d_min) { j_min = j d_min = D[i,j] } } } c[i] = j_min } return(c) } real colvector closest2(D, c1) { D2 = D for (i=1; i<=rows(D); i++) D2[i,c1[i]] = . return(closest(D2)) } real colvector closest3(D, c1, c2) { D2 = D for (i=1; i<=rows(D); i++) { D2[i,c1[i]] = . D2[i,c2[i]] = . } return(closest(D2)) } real matrix C_of_c(c) { C = J(rows(c), rows(c), 0) for (i=1; i<=rows(C); i++) C[i,c[i]] = 1 return(C) } /*read in data to mata, create distance matrix*/ st_view(all=., ., .) d=J(rows(all), rows(all), 0) /*d is the distance matrix, the numbers in the distance formula refer the /// column position of latitude and longitude*/ for (i=1; i<=rows(all); i++){ for (j=1; j<=rows(all); j++){ d[i,j]= sqrt( (all[i,2]:-all[j,2]):^2 + (all[i,3]:-all[j,3]):^2) } } /*apply closest functions to distance matrix, create 3 closest matrix*/ c1 = closest(d) c2 =closest2(d, c1) c3 =closest3(d, c1, c2) C1 = C_of_c(c1) C2 = C_of_c(c2) C3 = C_of_c(c3) top3=transposeonly(C1+C2+C3) /*transpose top matrix so that rows match /// to observations*/ /*make zeros missings*/ for (i=1; i<=rows(all); i++){ for (j=1; j<=rows(all); j++){ if (top3[i,j] == 0){ top3[i,j] = . } } } /*average test score of three nearest neighbors*/ test =all[.,5] testT3=test:*top3 avgtest = transposeonly((colsum(test:*top3)):/(colnonmissing(testT3))) /*percentage non-white/asian of three nearest neighbors*/ nonw=all[.,4] nonwT3=nonw:*top3 /*count non-missings to get denominator*/ denom2=colnonmissing(nonwT3) /*make zeros to missing, then count to get numerator*/ for (i=1; i<=rows(all); i++){ for (j=1; j<=rows(all); j++){ if (nonwT3[i,j] == 0){ nonwT3[i,j] = . } } } pernonwhite = transposeonly((colnonmissing(nonwT3)):/ denom2) /*add vectors to stata data set*/ (void) st_addvar("float", "avgtestNN") st_store(., "avgtestNN", avgtest) (void) st_addvar("float", "pernonwNN") st_store(., "pernonwNN", pernonwhite) end --------------------end example code------------------------------------- On 10/11/07, David M. Drukker <ddrukker@stata.com> wrote: > Sarah Cohodes <sarah.cohodes@gmail.com> asked how to calculate a summary > statistic of a variable for the k nearest neighbors. > > Austin Nichols < austinnichols@gmail.com> replied with a good solution. > > We put the minindex() function into Mata to handle problems like Sarah's. > > Below I outline a possible solution method using the minindex() function in > Mata. > > The two advantages of this solution are that it is fast and that the > minindex() function returns exactly what you want, a vector of the indices > of the smallest distances. > > I have appended a version of the code for a problem like Sarah's below. > The code > 1) simulates some data; > 2) copies the variables into Mata vectors; and > 3) for each observation it > a) finds the vector of indices of the closest observations, > b) extracts the vector of the closest observations from y, and > c) calculates the mean of the closest observations in y. > > To illustrate how the code works, ind, y[ind] and mean(y[ind]) are > displayed. In adopting this code for her own use, Sarah could remove > these display statements. > > To keep it simple, tied distances would expand the number of indices > returned by minindex() as discussed in help mata minindex(). > > I hope that this helps. > > --David > ddrukker@stata.com > > ---------------------------Begin example code---------------------------------- > version 10 > clear all > > set seed 12345 > set obs 1000 > > gen x1 = uniform() > gen x2 = uniform() > gen y = invnormal(uniform()) + x1^2 + x2^2 > > > mata: > > x1 = st_data(., "x1") // put x1 variable into x1 vector > x2 = st_data(., "x2") // put x2 variable into x2 vector > y = st_data(., "y") // put y variable into y vector > > n = rows(x1) > ind = . // initialize ind vector > w = . // initialize w vector > > // loop over observations > // I am working over first 3 > // observations for illustration > // purposes > // change the 3 to n for the full > // problem > for(i=1; i<=3; ++i) { > // calculate distance for i(th) > // observation > d = sqrt((x1:-x1[i]):^2 + (x2:-x2[i]):^2) > //put vector of minimum indices into > // ind, if no ties ignore w, if ties > // use w to handle ties > minindex(d, 5, ind, w) > > // display ind > "ind is " > ind > // display corresponding values from y > "y extract is " > y[ind] > // calculate mean of appropriate > // values of y > mean(y[ind]) > } > > end > ---------------------------End example code---------------------------------- > > > > > * > * For searches and help try: > * http://www.stata.com/support/faqs/res/findit.html > * http://www.stata.com/support/statalist/faq > * http://www.ats.ucla.edu/stat/stata/ > * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: RE: RE: mata (simple): generating random numbers** - Next by Date:
**st: Gen & replace** - Previous by thread:
**st: mata (simple): generating random numbers** - Next by thread:
**Re: st: calculating nearest neighbors; looping back to the beginning of observations** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |