How do I identify neighbors of points or areas on a rectangular grid in Stata?
|
Title
|
|
Neighbors on a rectangular grid
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
October 2004; minor revisions March 2007
|
Say you have spatial data for points or areas on a rectangular grid,
meaning a grid with a rectangular mesh, and not necessarily one whose
overall shape is rectangular. Some minor trickery with
by: allows you to
identify the neighbors of each point and thus be able to carry out some
simple spatial processing. Even if you are not especially interested in
spatial data, the answer provides good examples of the use of by:, so
read on.
Let us imagine a set-up like this:
x coordinates increasing to right
* *
* * * * * *
* * * * * * * * * * * * * *
* * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * *
y * * * * * * * * * * * * * * * * * * *
coordinates * * * * * * * * * * * * * * * * * * *
increasing * * * * * * * * * * * * * * * * * * * *
to * * * * * * * * * * * * * * * * * * * * *
top * * * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * * * * * *
* * * * * * * * * * * * * *
* * * * * *
We are going to assume each row is indexed by the same integer y
coordinate and each column is indexed by the same integer x
coordinate. As the diagram should suggest, we are not assuming that the
overall shape of the grid is rectangular or even some regular geometric
shape. For concreteness, we are assuming a Cartesian convention about
directions in which coordinates increase. However, a matrix convention in
which row numbers increase downwards is easily accommodated by minor changes
to what follows.
In addition, we are going to skate over the difference between gridded point
data (e.g., altitude or rainfall is measured at a notional point) and
gridded area data (e.g., number of people is recorded in a rectangle).
Although important otherwise, it is secondary to the question of
identifying neighbors for which the principles are identical. The
terminology of points will include areas.
Using a geographic convention, let us imagine the grid is oriented so
that north is at the top, and we can identify neighbors of each point to the
north, northeast, east, southeast, south, southwest, west and northwest.
Suppose we type
. by y (x), sort:
After such a sort,
points would be ordered first by y coordinate and then by x
coordinate. Thus for any observation, the previous observation in Stata's
memory is that to the immediate west, and the observation following is that
to the immediate east. Given some variable z with information on
what is happening at each x and y (altitude, number of people
in grid rectangle, whatever),
. by y (x), sort: generate z_w = z[_n-1]
. by y: generate z_e = z[_n+1]
However, what we just said is not quite true: it is incorrect for the
beginnings and ends of rows. For example, the observation before that for
the beginning of a row is for the previous row and is not its neighbor to
the west. Nevertheless, this code needs no extra tricks to cope with edge
problems. Because the first point in any row has no neighbor to the west,
z_w for that point will be filled in with missing. This is
ensured because everything is done under the aegis of by:.
Processing takes place within the groups of observations defined by
by:, and there can be no spurious carry-overs between rows.
Reversing the roles of the coordinates, we can get north and south
neighbors:
. by x (y), sort: generate z_s = z[_n-1]
. by x: generate z_n = z[_n+1]
The other four neighbors require one extra small trick. The NW and SE
neighbors of any point lie on the diagonal through that point for which x
+ y is constant, and the NE and SW neighbors of any point lie on the
other diagonal, for which x − y is constant. Given variables
constructed for these diagonal sums, the rest is just a variation on the
prevailing theme:
. generate long xpy = x + y
. generate long xmy = x - y
. by xpy (x), sort: generate z_nw = z[_n-1]
. by xpy: generate z_se = z[_n+1]
. by xmy (x), sort: generate z_sw = z[_n-1]
. by xmy (x): generate z_ne = z[_n+1]
That is it, really. We were a little paranoid in insisting on a long
variable type, just in case the sums (in particular) were very large. Quite
a lot of elementary spatial analysis is now in reach just by using the
constructed variables (and without any looping over observations or special
programming). For example, you can get an average of NEWS neighbors and one
of NW, NE, SW, SE neighbors by
. egen z_news = rowmean(z_n z_e z_w z_s)
. egen z_nwneswse = rowmean(z_nw z_ne z_sw z_se)
The egen function
rowmean() and its siblings are especially recommended here for doing
what you would usually regard as the right thing whenever one or more
variables are missing.
|