[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]
Re: st: calculating nearest neighbors; looping back to the beginning of observations

From	[email protected] (Jean Marie Linhart, StataCorp LP)
To	[email protected]
Subject	Re: st: calculating nearest neighbors; looping back to the beginning of observations
Date	Mon, 05 Nov 2007 14:49:32 -0600
Sarah Cohodes (sarah <dot> cohodes <at> gmail <dot> com) asked:

> However, I programmed myself into a corner, having not thought about
> that fact that I have  150,000 observations, so I get memory issues
> (see http://www.stata.com/support/faqs/win/winmemory.html, and perhaps
> the limits of Mata itself), and can't create the necessary distance
> matrix, which is square.  I still wanted to post the code below as I
> think it is a good solution for fewer observations, and see if anyone
> had suggestions for getting around this issue in my case.  I'm going
> to return to David's suggestion, since it doesn't calculate the full
> matrix at once and see if I can get to the eventual variables that I
> need from it, and would appreciate any suggestions on that approach or
> another approach.

I'm writing primarily because Sarah's small dataset contains ties.  Sarah needs
to determine how she wants to handle ties.  David's code uses minindex() which
returns all the observations that are at the 3 nearest distances.  I.e. if the
distances, in increasing order, are .1, .1, .2, .2, .2, .3, .3, .4, etc.
minindex() will return the locations of the seven observations at the three
closest distances.

In Sarah's original code she takes only the first 3 neighbors she finds.

In the case of ties, I would take the smallest number of tied observations, so
long as I get at least three.  I may get more than three.  In the above
example, I would take the 5 closest observations.  I want at least 3, so I need
both observations at distance .1, and at least one more.  There three
observations at the next distance, .2, and since they are equivalent, I would
take all of them; this gives me five.

I do the calculation by looping over observations to create a vector of
distances from the current observation, rather than a full distance matrix
(this ran Sarah out of memory).  Looping over observations is quick in Mata,
but the searching minindex() does can become time-consuming when you have lots
of data.

I did a quick time-study on my (relatively fast) computer with 10000
observations, 20000 observations and 40000 observations.  10000 observations
ran in about 30 seconds, and 20000 observations ran in about 2 minutes, and
40000 observations ran in about 8 minutes.  The amount of time scales as n^2
where n is the number of observations, e.g. increasing the number of
observations by a factor of 4 increased the time by a factor of 16.  That
means, on my relatively fast computer, I would expect 150,000 observations (not
quite 4 times 40,000 observations) to run in approximately 2 hours.  Not
knowing how fast Sarah's computer is, I think she should test to make sure the
code is doing what she wants on a small dataset, then set this computation up
to run on her large dataset before she goes home at night!

I'm including two do files after my signature.  The first takes only the first
three neighbors, ignoring equivalences.  This is equivalent to Sarah's old
code.  The second takes all the equivalent nearest neighbors, so long as I get
at least three.  This gives a different result for the 8th observation in
Sarah's sample dataset.

--Jean Marie
[email protected]

--------Begin nn_ignore_ties.do --------

/* take only the first 3 nearest neighbors, disregarding ties */

version 10
clear all

input id longitude latitude nonwhite testscore
1 5 24 0 1300
2 12 20 1 1600
3 13 22 1 1000
4 14 21 1 1470
5 15 19 1 1200
6 2 27 0 1050
7 8 27 1 850
8 15 28 0 900
9 13 25 1 1140
10 8 26 0 1100
end

mata:

	longitude = st_data(., "longitude")	// get data
	latitude = st_data(., "latitude")
	nonwhite = st_data(., "nonwhite")
	testscore = st_data(., "testscore")

	nr = rows(longitude)
	d = J(nr,1, .) 				// initialize
	avgtest = J(nr,1, .)
	pernonwhite = J(nr,1, .)
	ii = .
	ww = .

	for (i=1; i<= nr; i++){			// loop over observations
		d = sqrt( (longitude:-longitude[i]):^2 +
			(latitude:-latitude[i]):^2)	// distance
		d[i] = . 			// exclude self
		minindex(d, 3, ii, ww) 		// 3 nearest neighbors 
	/*
	 * ii is now a real colvector with indices of mins in order.
	 * 	the rows of ii may exceed k.
	 * ww is a Kx2 real vector K<= k with the repetetiveness of the
	 * 	k mins -- first column is index of m-th min, and the 2nd
	 *	column is the number of equivalences. 
	 * Note there may be ties!  Even if there are ties, this code
	 *	takes only the first 3 neighbors returned.
	 */
		j = rows(ii) 		// just in case < 3 neighbors!
		if (3 <= j) j = 3

		testT3 = testscore[ii[|1\j|]]	// testscores for neighbors
		avgtest[i] = colsum(testT3)/rows(testT3) 
		nonwT3 = nonwhite[ii[|1\j|]]	// nonwhite for neighbors
		pernonwhite[i] = colsum(nonwT3)/colnonmissing(nonwT3)
}

	/*add vectors to stata data set*/
	(void) st_addvar("float", "avgtestNN")
	st_store(., "avgtestNN", avgtest)
	(void) st_addvar("float", "pernonwNN")
	st_store(., "pernonwNN", pernonwhite)

end
-------- end nn_ignore_ties.do -------- 

-------- begin nn.do -------- 

/* take all equivalent nearest neighbor equivalents so that you get at
least 3 neighbors */

version 10
clear all

input id longitude latitude nonwhite testscore
1 5 24 0 1300
2 12 20 1 1600
3 13 22 1 1000
4 14 21 1 1470
5 15 19 1 1200
6 2 27 0 1050
7 8 27 1 850
8 15 28 0 900
9 13 25 1 1140
10 8 26 0 1100
end

mata:

	longitude = st_data(., "longitude")	// get data
	latitude = st_data(., "latitude")
	nonwhite = st_data(., "nonwhite")
	testscore = st_data(., "testscore")

	nr = rows(longitude)
	d = J(nr,1, .) 				// initialize
	avgtest = J(nr,1, .)
	pernonwhite = J(nr,1, .)
	ii = .
	ww = .

	for (i=1; i<= nr; i++){			// loop over observations
		d = sqrt( (longitude:-longitude[i]):^2 +
			(latitude:-latitude[i]):^2)	// distance
		d[i] = . 			// exclude self
		minindex(d, 3, ii, ww) 		// 3 nearest neighbors 
	/*
	 * ii is now a real colvector with indices of mins in order.
	 * 	the rows of ii may exceed k.
	 * ww is a Kx2 real vector K<= k with the repetetiveness of the
	 * 	k mins -- first column is index of m-th min, and the 2nd
	 *	column is the number of equivalences. 
	 * Note there may be ties!  If there are ties, this code
	 *	takes at least 3.  It will take the least number greater
	 *	than three so that it gets all of the tied values.
	 */
		j = 1				// first index
		kc = 0				// # of neighbors
		while(kc < 3){		
			kc = kc + ww[j,2]	// add in equivalents
			j++			// increase index
		}
		// kc now has # of equivalent neighbors
	/*
	 * Uncomment next 3 lines if you want information on ties.  This
	 * is very obnoxious with large amounts of data! 
	 */
//		if (kc > 3) {
//			printf("For observation %f, more than 3 nearest neighbors\n",i)
//		}
		testT3 = testscore[ii[|1\kc|]]	// testscores for neighbors
		avgtest[i] = colsum(testT3)/rows(testT3) 
		nonwT3 = nonwhite[ii[|1\kc|]]	// nonwhite for neighbors
		pernonwhite[i] = colsum(nonwT3)/colnonmissing(nonwT3)
}

	/*add vectors to stata data set*/
	(void) st_addvar("float", "avgtestNN")
	st_store(., "avgtestNN", avgtest)
	(void) st_addvar("float", "pernonwNN")
	st_store(., "pernonwNN", pernonwhite)

end
-------- end nn.do --------
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
Prev by Date: st: graph plot region size control
Next by Date: st: RE: graph plot region size control
Previous by thread: Re: st: calculating nearest neighbors; looping back to the beginning of observations
Next by thread: st: Gen & replace
Index(es):
- Date
- Thread