Home  /  Resources & support  /  FAQs  /  Neighbors on a rectangular grid

How do I identify neighbors of points or areas on a rectangular grid in Stata?

Title   Neighbors on a rectangular grid
Author Nicholas J. Cox, Durham University, UK

Say you have spatial data for points or areas on a rectangular grid, meaning a grid with a rectangular mesh, and not necessarily one whose overall shape is rectangular. Some minor trickery with by: allows you to identify the neighbors of each point and thus be able to carry out some simple spatial processing. Even if you are not especially interested in spatial data, the answer provides good examples of the use of by:, so read on.

Let us imagine a set-up like this:

		      x coordinates increasing to right
		      
		                * *
		              * * * * * *
		          * * * * * * * * * * * * * *
		        * * * * * * * * * * * * * * * *
		      * * * * * * * * * * * * * * * * * * 
      y       	      * * * * * * * * * * * * * * * * * * * 
      coordinates     * * * * * * * * * * * * * * * * * * * 
      increasing      * * * * * * * * * * * * * * * * * * * * 
      to              * * * * * * * * * * * * * * * * * * * * * 
      top             * * * * * * * * * * * * * * * * * * * * * 
		      * * * * * * * * * * * * * * * * * * * * 
	            * * * * * * * * * * * * * * * * * * * *
	            * * * * * * * * * * * * * * * * * * *
	          * * * * * * * * * * * * * * * * * * * *
	            * * * * * * * * * * * * * * * * * * *
		      * * * * * * * * * * * * * * * * * *
		            * * * * * * * * * * * * * * 
		                      * * * * * * 
				

We are going to assume each row is indexed by the same integer y coordinate and each column is indexed by the same integer x coordinate. As the diagram should suggest, we are not assuming that the overall shape of the grid is rectangular or even some regular geometric shape. For concreteness, we are assuming a Cartesian convention about directions in which coordinates increase. However, a matrix convention in which row numbers increase downwards is easily accommodated by minor changes to what follows.

In addition, we are going to skate over the difference between gridded point data (e.g., altitude or rainfall is measured at a notional point) and gridded area data (e.g., number of people is recorded in a rectangle). Although important otherwise, it is secondary to the question of identifying neighbors for which the principles are identical. The terminology of points will include areas.

Using a geographic convention, let us imagine the grid is oriented so that north is at the top, and we can identify neighbors of each point to the north, northeast, east, southeast, south, southwest, west and northwest.

Suppose we type

 . by y (x), sort: 

After such a sort, points would be ordered first by y coordinate and then by x coordinate. Thus for any observation, the previous observation in Stata's memory is that to the immediate west, and the observation following is that to the immediate east. Given some variable z with information on what is happening at each x and y (altitude, number of people in grid rectangle, whatever),

 . by y (x), sort: generate z_w = z[_n-1] 
 . by y: generate z_e = z[_n+1] 

However, what we just said is not quite true: it is incorrect for the beginnings and ends of rows. For example, the observation before that for the beginning of a row is for the previous row and is not its neighbor to the west. Nevertheless, this code needs no extra tricks to cope with edge problems. Because the first point in any row has no neighbor to the west, z_w for that point will be filled in with missing. This is ensured because everything is done under the aegis of by:. Processing takes place within the groups of observations defined by by:, and there can be no spurious carry-overs between rows.

Reversing the roles of the coordinates, we can get north and south neighbors:

 . by x (y), sort: generate z_s = z[_n-1] 
 . by x: generate z_n = z[_n+1] 

The other four neighbors require one extra small trick. The NW and SE neighbors of any point lie on the diagonal through that point for which x + y is constant, and the NE and SW neighbors of any point lie on the other diagonal, for which x − y is constant. Given variables constructed for these diagonal sums, the rest is just a variation on the prevailing theme:

 . generate long xpy = x + y           
 . generate long xmy = x - y           
 . by xpy (x), sort: generate z_nw = z[_n-1] 
 . by xpy: generate z_se = z[_n+1] 
 . by xmy (x), sort: generate z_sw = z[_n-1] 
 . by xmy (x): generate z_ne = z[_n+1] 

That is it, really. We were a little paranoid in insisting on a long variable type, just in case the sums (in particular) were very large. Quite a lot of elementary spatial analysis is now in reach just by using the constructed variables (and without any looping over observations or special programming). For example, you can get an average of NEWS neighbors and one of NW, NE, SW, SE neighbors by

 . egen z_news = rowmean(z_n z_e z_w z_s)
 . egen z_nwneswse = rowmean(z_nw z_ne z_sw z_se)

The egen function rowmean() and its siblings are especially recommended here for doing what you would usually regard as the right thing whenever one or more variables are missing.