Do you know what SEM is? (If you know what SEM is, read an overview of Stata’s SEM features.)

SEM stands for structural equation modeling. SEM is

- A notation for specifying structural equation models.
- A way of thinking about structural equation models.
- Methods for estimating the parameters of structural equation models.

Stata’s **sem** implements linear structural equation
models. **gsem** provides extensions to linear SEMs that allow for
generalized-linear models and multilevel models.

For those of you unfamiliar with SEM, it is worth your time to learn about
it if you ever fit linear regressions, multivariate linear regressions,
seemingly unrelated regressions, or simultaneous systems, or if you are
interested in generalized method of moments (GMM). With the
generalizations provided by **gsem**, it is also worth your time to learn
about SEM if you ever fit models with binary, count, ordinal, or nominal
responses or if you ever fit multilevel mixed-effects models, selection
models, or endogenous treatment-effects models.

Here, we provide an introduction to linear SEM, which is based on the linear model. What it brings to the table is flexible specification—nearly anything can be allowed to be correlated or constrained to be uncorrelated—and unobserved (latent) variables which can be treated (almost) as if they were observed.

**sem** fits the first and second moments of the distribution of observed
variables—means, variances, and covariances—rather than fitting the
observed values themselves. Both maximum likelihood and GMM methods are
available; **sem** uses a weighting matrix corresponding to asymptotic
distribution free estimation in the SEM literature.

You still think of the model in the same way as usual, but in a model like

y_{j}=β_{0}+β_{1}x_{1j}+ ... +β_{k}x_{kj}+e_{j}

let’s now call *e*_{j} the error. Reserve the word
residual for the true residuals of the SEMs, which are the differences
between the observed and predicted moments.

When **sem** is used to fit models that can be fit by the other linear
estimators, results are the same, asymptotically the same—by which we
mean different in finite samples, and there is no theoretical reason to
prefer one set of estimated results to the other—or the SEM results are
asymptotically the same and the **sem** results should be better in finite
samples because of theoretical reasons.

Individual structural equation models are usually described using path diagrams, such as

This diagram is composed of

- Boxes and circles with variable names written inside them.
- Boxes contain variables that are observed in the data.
- Circles contain variables that are unobserved, known as latent variables.

- Arrows, called paths, that connect some of the boxes and circles.
- When a path points from one variable to another, that means the first variable affects the second.
- More precisely, if
*s*->*d*, that means to add β_{k}to the linear equation for*d*. β_{k}is called the path coefficient. - Sometimes small numbers are written along the arrow
connecting two variables. That means
β
_{k}is constrained to be the value specified. - When no number is written along the arrow, the corresponding coefficient is to be estimated from the data. Sometimes symbols are written along the path arrow to emphasize this, and sometimes not.
- The same path diagram used to describe the model can be used to display the results of estimation. In that case, estimated coefficients appear along the paths.

- Not shown above are curved, double-headed paths that are used to indicate covariances where they would not be otherwise assumed. Exogenous variables are assumed to be correlated.

Thus the above figure corresponds to the equations

x1 = α_{1}+ β_{1}X + e.x1 x2 = α_{2}+ β_{2}X + e.x2 x3 = α_{3}+ β_{3}X + e.x3 x4 = α_{4}+ β_{4}X + e.x4

There’s a third way of writing this model, namely

(x1<-X) (x2<-X) (x3<-X) (x4<-X)

This is the way we could write the model if we wanted to use
**sem**’s command syntax rather than drawing the model in
**sem**’s GUI. The full command we would type would be

. sem (x1<-X) (x2<-X) (x3<-X) (x4<-X)

However we write this model, what is it? It is a *measurement
model*, a term loaded with meaning for some researchers. **X**
might be mathematical ability. **x1**, **x2**, **x3**, and
**x4** might be scores from tests designed to measure mathematical
ability. **x1** might be the score based on your answers to a series
of questions after reading this section.

The model we have just drawn, written in mathematical notation, or written in Stata command notation can be interpreted in other ways too. Look at this diagram:

Despite appearances, this diagram is identical to the previous diagram
except that we have renamed **x4** to be **y**. The fact that we
changed a name obviously does not matter substantively. That fact that we
have rearranged the boxes in the diagram is irrelevant, too; paths connect
the same variables in the same directions. The equations for the above
diagrams are the same as the previous equations with the substitution of
**y** for **x4**:

x1 = α_{1}+ β_{1}X + e.x1 x2 = α_{2}+ β_{2}X + e.x2 x3 = α_{3}+ β_{3}X + e.x3 y = α_{4}+ β_{4}X + e.y

The Stata command notation changes similarly,

(x1<-X) (x2<-X) (x3<-X) (y<-X)

Many people looking at the model written in this way might decide that it
is not a measurement model but a measurement *error* model. y
depends on **X**, but we do not observe **X**. We do observe
**x1**, **x2**, and **x3**, each a measurement of **X** but
with error. Our interest is in knowing **β _{4}**, the
effect of true

A few others might disagree and instead see a model for interrater agreement. Obviously we have four raters who each make a judgment, and we want to know how well the judgment process works and how well each of these raters perform.

You are now ready to return to our description of Stata’s
**sem** command, but before you do, let us show you an example
we think will appeal to you.

In our documentation, we have an example of a single-factor measurement model, which is demonstrated using the following data:

Variable | Obs Mean Std. Dev. Min Max | |

x1 | 123 96.28455 14.16444 54 131 | |

x2 | 123 97.28455 16.14764 64 135 | |

x3 | 123 97.09756 15.10207 62 138 | |

x4 | 123 690.9837 77.50737 481 885 |

As we mentioned above, if we rename variable **x4** to be y, we
can reinterpret this measurement model as a measurement *error*
model. In this interpretation, **X** is the unobserved true value.
**x1**, **x2**, and **x3** are each measurements of **X**,
but with error. Meanwhile, **y (x4)** is really something else
entirely. Perhaps **y** is earnings, and we believe

y = α_{4}+ β_{4}X + e.y

We are interested in **β _{4}**, the effect of true

If we were to go back to the data and type **regress y x1**, we would
obtain an estimate of **β _{4}**, but we would expect that
estimate to be biased toward zero because of the errors-in-variable
problem. The same applies for

β_{4}based onregress y x14.09 β_{4}based onregress y x23.71 β_{4}based onregress y x33.70

In the example in our manual, we fit

. sem (x1<-X) (x2<-X) (x3<-X) (y<-X)

and we obtained

β_{4}based onsem (y<-X)6.89

That **β _{4}** might be 6.89 seems plausible because we
expect the estimate to be larger than the estimates we obtain
using the variables measured with error. In fact, we can tell you that the
6.89 estimate is quite good because we at StataCorp know that the true
value of

Now you can return to our description of Stata's linear structural equation modeling (SEM) features. If you are interested in multilevel modeling or models with binary, count, ordinal, or nominal response variables, you will also want to see the description of the generalized SEM features.