[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
jlinhart@stata.com (Jean Marie Linhart, StataCorp LP) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: calculating nearest neighbors; looping back to the beginning of observations |

Date |
Mon, 05 Nov 2007 14:49:32 -0600 |

Sarah Cohodes (sarah <dot> cohodes <at> gmail <dot> com) asked: > However, I programmed myself into a corner, having not thought about > that fact that I have 150,000 observations, so I get memory issues > (see http://www.stata.com/support/faqs/win/winmemory.html, and perhaps > the limits of Mata itself), and can't create the necessary distance > matrix, which is square. I still wanted to post the code below as I > think it is a good solution for fewer observations, and see if anyone > had suggestions for getting around this issue in my case. I'm going > to return to David's suggestion, since it doesn't calculate the full > matrix at once and see if I can get to the eventual variables that I > need from it, and would appreciate any suggestions on that approach or > another approach. I'm writing primarily because Sarah's small dataset contains ties. Sarah needs to determine how she wants to handle ties. David's code uses minindex() which returns all the observations that are at the 3 nearest distances. I.e. if the distances, in increasing order, are .1, .1, .2, .2, .2, .3, .3, .4, etc. minindex() will return the locations of the seven observations at the three closest distances. In Sarah's original code she takes only the first 3 neighbors she finds. In the case of ties, I would take the smallest number of tied observations, so long as I get at least three. I may get more than three. In the above example, I would take the 5 closest observations. I want at least 3, so I need both observations at distance .1, and at least one more. There three observations at the next distance, .2, and since they are equivalent, I would take all of them; this gives me five. I do the calculation by looping over observations to create a vector of distances from the current observation, rather than a full distance matrix (this ran Sarah out of memory). Looping over observations is quick in Mata, but the searching minindex() does can become time-consuming when you have lots of data. I did a quick time-study on my (relatively fast) computer with 10000 observations, 20000 observations and 40000 observations. 10000 observations ran in about 30 seconds, and 20000 observations ran in about 2 minutes, and 40000 observations ran in about 8 minutes. The amount of time scales as n^2 where n is the number of observations, e.g. increasing the number of observations by a factor of 4 increased the time by a factor of 16. That means, on my relatively fast computer, I would expect 150,000 observations (not quite 4 times 40,000 observations) to run in approximately 2 hours. Not knowing how fast Sarah's computer is, I think she should test to make sure the code is doing what she wants on a small dataset, then set this computation up to run on her large dataset before she goes home at night! I'm including two do files after my signature. The first takes only the first three neighbors, ignoring equivalences. This is equivalent to Sarah's old code. The second takes all the equivalent nearest neighbors, so long as I get at least three. This gives a different result for the 8th observation in Sarah's sample dataset. --Jean Marie jlinhart@stata.com --------Begin nn_ignore_ties.do -------- /* take only the first 3 nearest neighbors, disregarding ties */ version 10 clear all input id longitude latitude nonwhite testscore 1 5 24 0 1300 2 12 20 1 1600 3 13 22 1 1000 4 14 21 1 1470 5 15 19 1 1200 6 2 27 0 1050 7 8 27 1 850 8 15 28 0 900 9 13 25 1 1140 10 8 26 0 1100 end mata: longitude = st_data(., "longitude") // get data latitude = st_data(., "latitude") nonwhite = st_data(., "nonwhite") testscore = st_data(., "testscore") nr = rows(longitude) d = J(nr,1, .) // initialize avgtest = J(nr,1, .) pernonwhite = J(nr,1, .) ii = . ww = . for (i=1; i<= nr; i++){ // loop over observations d = sqrt( (longitude:-longitude[i]):^2 + (latitude:-latitude[i]):^2) // distance d[i] = . // exclude self minindex(d, 3, ii, ww) // 3 nearest neighbors /* * ii is now a real colvector with indices of mins in order. * the rows of ii may exceed k. * ww is a Kx2 real vector K<= k with the repetetiveness of the * k mins -- first column is index of m-th min, and the 2nd * column is the number of equivalences. * Note there may be ties! Even if there are ties, this code * takes only the first 3 neighbors returned. */ j = rows(ii) // just in case < 3 neighbors! if (3 <= j) j = 3 testT3 = testscore[ii[|1\j|]] // testscores for neighbors avgtest[i] = colsum(testT3)/rows(testT3) nonwT3 = nonwhite[ii[|1\j|]] // nonwhite for neighbors pernonwhite[i] = colsum(nonwT3)/colnonmissing(nonwT3) } /*add vectors to stata data set*/ (void) st_addvar("float", "avgtestNN") st_store(., "avgtestNN", avgtest) (void) st_addvar("float", "pernonwNN") st_store(., "pernonwNN", pernonwhite) end -------- end nn_ignore_ties.do -------- -------- begin nn.do -------- /* take all equivalent nearest neighbor equivalents so that you get at least 3 neighbors */ version 10 clear all input id longitude latitude nonwhite testscore 1 5 24 0 1300 2 12 20 1 1600 3 13 22 1 1000 4 14 21 1 1470 5 15 19 1 1200 6 2 27 0 1050 7 8 27 1 850 8 15 28 0 900 9 13 25 1 1140 10 8 26 0 1100 end mata: longitude = st_data(., "longitude") // get data latitude = st_data(., "latitude") nonwhite = st_data(., "nonwhite") testscore = st_data(., "testscore") nr = rows(longitude) d = J(nr,1, .) // initialize avgtest = J(nr,1, .) pernonwhite = J(nr,1, .) ii = . ww = . for (i=1; i<= nr; i++){ // loop over observations d = sqrt( (longitude:-longitude[i]):^2 + (latitude:-latitude[i]):^2) // distance d[i] = . // exclude self minindex(d, 3, ii, ww) // 3 nearest neighbors /* * ii is now a real colvector with indices of mins in order. * the rows of ii may exceed k. * ww is a Kx2 real vector K<= k with the repetetiveness of the * k mins -- first column is index of m-th min, and the 2nd * column is the number of equivalences. * Note there may be ties! If there are ties, this code * takes at least 3. It will take the least number greater * than three so that it gets all of the tied values. */ j = 1 // first index kc = 0 // # of neighbors while(kc < 3){ kc = kc + ww[j,2] // add in equivalents j++ // increase index } // kc now has # of equivalent neighbors /* * Uncomment next 3 lines if you want information on ties. This * is very obnoxious with large amounts of data! */ // if (kc > 3) { // printf("For observation %f, more than 3 nearest neighbors\n",i) // } testT3 = testscore[ii[|1\kc|]] // testscores for neighbors avgtest[i] = colsum(testT3)/rows(testT3) nonwT3 = nonwhite[ii[|1\kc|]] // nonwhite for neighbors pernonwhite[i] = colsum(nonwT3)/colnonmissing(nonwT3) } /*add vectors to stata data set*/ (void) st_addvar("float", "avgtestNN") st_store(., "avgtestNN", avgtest) (void) st_addvar("float", "pernonwNN") st_store(., "pernonwNN", pernonwhite) end -------- end nn.do -------- * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: graph plot region size control** - Next by Date:
**st: RE: graph plot region size control** - Previous by thread:
**Re: st: calculating nearest neighbors; looping back to the beginning of observations** - Next by thread:
**st: Gen & replace** - Index(es):

© Copyright 1996–2014 StataCorp LP | Terms of use | Privacy | Contact us | What's new | Site index |