Home  /  Products  /  Features  /  Spatial autoregressive models

<-  See Stata's other features


  • Spatial lags of:

    • Dependent variable

    • Independent variables

    • Autoregressive errors

  • Endogenous covariates

  • Heteroskedastic errors

  • Cross-sectional data

  • Panel data

    • Fixed-effects models

    • Random-effects models

      • i.i.d. panel effects

      • autoregressive panel effects

  • Analyze spillover

    • Direct effects

    • Indirect effects

  • Spatial weighting matrices

    • Inverse distance

    • Contiguity—nearest neighbor

    • Custom

    • Choice of normalization: spectral, minmax, row, more

  • Import shapefiles

Neighboring towns have more influence on each other than on towns far away. The same is true of countries that are close to each other and of closely connected friends on social media.

Spatial autoregressive models are fit using datasets that contain observations on geographical areas. Observations are called spatial units and might be countries, states, counties, postal codes, or city blocks. Alternatively, they might not be geographically based at all; they could be nodes of a social network.

Datasets contain a continuous outcome variable—such as incidence of disease, output of farms, or crime rate—along with other variables to predict the outcome. For cross-sectional data, each variable has one value per spatial unit. For panel data, there are typically multiple values for different time points.

There is a manual entirely devoted to fitting SAR models, working with spatial data, and creating and managing spatial weighting matrices. The commands are called the Sp commands. See the Spatial Autoregressive Models Reference Manual.

Let's see it work

There are three steps to fitting SAR models:

  1. Getting your data ready for analysis.

  2. Creating the spatial weighting matrices your model needs.

  3. Running your SAR model.

Stata's Sp commands will work with or without shapefiles, files commonly used to define maps. They will work with other location data or even work with data without locations at all, such as social network data.

Here's an analysis from start to finish.

Step 1. Find and translate shapefile

You do not have to use shapefiles, but we will. We downloaded a standard-format shapefile, tl_2016_us_county.zip, that we found at https://catalog.data.gov/dataset/tiger-line-shapefile-2016-nation-u-s-current-county-and-equivalent-national-shapefile.

We copied the file to our current directory, then typed in Stata:

. unzipfile tl_2016_us_county.zip
(output omitted)

. spshape2dta tl_2016_us_county
(importing .shp file)
(importing .dbf file)
(creating _ID spatial-unit id)
(creating _CX coordinate)
(creating _CY coordinate)

file tl_2016_us_county_shp.dta created
file tl_2016_us_county.dta     created

spshape2dta did its magic and created two Stata datasets for us. One is a Stata-format shapefile:


The other is a Stata dataset containing the other data that were in the shapefile bundle:


spshape2dta also linked the two files.

Step 2: Prepare the data for analysis

We have our own analysis data for these counties in texas_ue.dta. We could just use them and skip the shapefile, but our data do not have the coordinates of the counties. We could not calculate distances or find neighbors. We could not do an SAR analysis. That's why we got the shapefile from the U.S. Census; it will provide all of this information.

We will merge our data with tl_2016_us_county.dta. We will first create an ID variable for merging the files. We also tell Sp that the Census provided the coordinates in latitude and longitude and that we want distances reported in miles.

. use tl_2016_us_county

. generate long fips = real(STATEFP + COUNTYFP)

. spset fips, modify replace
(output omitted)

. spset, modify coordsys(latlong, miles)

      Sp dataset: tl_2016_us_county.dta
Linked shapefile: tl_2016_us_county_shp.dta
            Data: Cross sectional
 Spatial-unit ID: _ID (equal to fips)
     Coordinates: _CY, _CX (latitude-and-longitude, miles)

. save, replace
file tl_2016_us_county.dta saved

. use texas_ue, clear

. merge 1:1 fips using tl_2016_us_county, keep(match)
(output omitted)

Our data are ready for analysis.

Step 3: Creating the spatial weighting matrices

We plan on fitting a model with spatial lags of the dependent variable, spatial lags of a covariate, and spatial autoregressive errors. Spatial lags are defined by spatial weighting matrices. We will use one matrix for the variables and another for the errors.

Sp provides many ways to create spatial weighting matrices. We will use just two of its predefined formulations:

. spmatrix create contiguity W

. spmatrix create idistance M

. spmatrix dir

Weighting matrix name N x N Type Normalization
M 254 x 254 idistance spectral
W 254 x 254 contiguity spectral

We created W to be a contiguity matrix based on nearest neighbors.

We created M to be the inverse of the distance between counties.

We let Sp perform its default normalization, which is spectral (largest eigenvalue). We could have chosen row or min–max normalization.

Sp also provides commands that let you create custom weighting matrices. You can create them from Stata data by writing Mata code or by importing them from a text file. The Mata capability is of special interest because it is so easy to use.

Step 4: Fitting the model

We have prepared our data.

We have defined the spatial weighting matrices we need.

We can now fit our model.

. spregress unemployment college, gs2sls dvarlag(W) ivarlag(W:college) errorlag(M)

  (254 observations)
  (254 observations (places) used)
  (weighting matrices define 254 places)

Estimating rho using 2SLS residuals:

initial:       GMM criterion =  .00565316
alternative:   GMM criterion =  .00235416
rescale:       GMM criterion =  .00004209
Iteration 0:   GMM criterion =  .00004209
Iteration 1:   GMM criterion =  7.292e-06
Iteration 2:   GMM criterion =  7.195e-06

Estimating rho using GS2SLS residuals:

Iteration 0:   GMM criterion =  .01457475
Iteration 1:   GMM criterion =   .0120462
Iteration 2:   GMM criterion =   .0118037
Iteration 3:   GMM criterion =  .01180212
Iteration 4:   GMM criterion =  .01180014
Iteration 5:   GMM criterion =   .0117996
Iteration 6:   GMM criterion =  .01179915
Iteration 7:   GMM criterion =  .01179899
Iteration 8:   GMM criterion =  .01179888
Iteration 9:   GMM criterion =  .01179883

Spatial autoregressive model                    Number of obs     =        254
GS2SLS estimates                                Wald chi2(3)      =      43.52
                                                Prob > chi2       =     0.0000
                                                Pseudo R2         =     0.2081

unemployment Coef. Std. Err. z P>|z| [95% Conf. Interval]
college -.067419 .0122785 -5.49 0.000 -.0914843 -.0433536
_cons 5.715733 .4246597 13.46 0.000 4.883415 6.54805
college -.0424388 .0213227 -1.99 0.047 -.0842306 -.000647
unemployment .2058481 .0961201 2.14 0.032 .0174562 .39424
e.unemploy~t 3.247298 1.369204 2.37 0.018 .5637078 5.930888
Wald test of spatial terms: chi2(3) = 9.77 Prob > chi2 = 0.0206

We used the generalized spatial two-stage least-squares (GS2SLS) estimator. The GS2SLS estimator lets us fit multiple spatial lags, potentially allowing us to better approximate the true spatial dependence. Alternatively, we could have used a maximum likelihood estimator to fit the model.

Our dependent variable is the unemployment rate (unemployment) and we think that unemployment is affected by the proportion of the adult population who hold college degrees (college).

Step 5: Interpreting the model

If you are familiar with SAR models, you know they are difficult to interpret because the coefficients are a combination of direct and indirect effects. Direct effects are the effects of the spatial unit on itself. Indirect effects are the effects spatial units have on other spatial units, also known as spillover effects.

Stata's estat impact command splits out those effects.

. estat impact

direct     :100%
indirect   :100%
total      :100%

Average impacts                                 Number of obs     =        254

dy/dx Std. Err. z P>|z| [95% Conf. Interval]
college -.0690788 .0121434 -5.69 0.000 -.0928795 -.0452781
college -.0586373 .0299383 -1.96 0.050 -.1173152 .0000407
college -.1277161 .0320147 -3.99 0.000 -.1904638 -.0649684

estat impact is essential for interpreting the results of SAR models and works after all Sp estimators. We see that a 1-percentage point increase in those holding college degrees in a county reduces unemployment by 0.07 percentage points in that same county, the direct effect. The spillover effect to neighboring counties is almost as large—a 0.06 percentage point expected reduction in unemployment.

Tell me more

Learn more about Stata's spatial autoregressive models features.

There is an entire manual dedicated to SAR, and it has friendly introductions to the subject. See the Spatial Autoregressive Models Reference Manual.