[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: Mediating variables

From   "Stas Kolenikov" <>
Subject   Re: st: Mediating variables
Date   Mon, 6 Oct 2008 12:45:19 -0500

You would want to read up on econometric systems of simultaneous
equations estimated by -reg3- in Stata; as I said on SEMNET a couple
of times, it is amazing that this foundational methodology is not
taught in the standard social science quant sequences, and not covered
enough in SEM courses.

Following those econometric methods guidelines, we can see that the
equation for r is identified, since it does not have any endogeneous
variables. However, the equation for Y is underidentified, as it fails
the order condition which says, the number of excluded exogenous
variables (here, none) should be at least as great as the number of
included endogenous variables (here, one). You can estimate the
equation for r with OLS; however the equation for Y is not estimable
by any method. In terms of instrumental variables, your prediction for
r-hat from the first stage regression will be perfectly collinear with
x, and hence the second stage linear regression for the first equation
will break down.

If you specify this as a structural equation model, then you have 6
moments (three variances and three covariances), but seven parameters
(d, b, c, Var[x] which is exactly identified, and three elements in
variance-covariance matrix of epsilons). You can again show that the
second equation is (exactly) identified -- that's an OLS, in the end.
But you cannot identify the first equation unless you impose some
additional assumptions, such as zero correlation of epsilons -- this
is what Mplus or other SEM software might be doing implicitly, but
econometric techniques insist of having epsilons correlated, and you
seem to be interested in that, too. If you do impose that
un-correlated restriction though, your model will be just identified,
and you won't have any degrees of freedom to test whether the
correlation is indeed zero. So I have no idea how this was done in
earlier work on mediation you mentioned if the model is
underidentified. This is such a basic failure that no workaround is
possible at all.

As for the panel aspect of your data, I have not seen this done in
panel way, although I imagine it is known in econometrics. Again with
covariance structure modeling and balanced panels, you can represent
your model in the "wide" format with variables x1, r1, Y1 for first
period, blah-blah-blah, xT, rT, YT in the last period, and coming up
with a covariance structure model with tons of parameter restrictions
(of all the parameters being the same in all periods). Being true to
your data, you would need to incorporate some panel effects u1 and u2
in the two equations that are common for all time periods, with
epsilon1 and epsilon2 being distinct in each time period. Establishing
identification of such a model will be difficult to extremely
difficult, although I imagine you can just go along the lines of the
"all observed"/simultaneous equations system, incorporating the known
restrictions. My intuitions on this says that IF this model is
identified for large enough T, you might need to have at least three
time periods to get anything sensible.

It does not seem like you can get enough leverage out of -reg3- on
this occasion, as my brief look through it suggests that you cannot
specify a parametric structure for your residuals covariance matrix
(which will have some sort of block/Kronecker product structure based
on covariances of unique errors epsilon and panel level errors u). You
should be able to set this up as a GLLAMM model with three levels:
level 1, the response variables; level 2, a single time occasion;
level 3, person (or whatever your longitudinal unit is). For GLLAMM,
you would need to represent your data in long format, with a single
response variable responsible for all of r's and Y's in all time
periods... and all the accompanying mess of specifying GLLAMM models.
You probably could write down your own likelihood -ml d0-, but it
should be easier just to figure out GLLAMM for this.

Now, to the updates you posted: if you have another exogenous variable
q that affects r but does not affect Y, then it solves the order
conditions mentioned above. You would need to check the rank condition
based on some matrices, and it looks like your system will be
identified, then. And the other piece of good news is that it will be
estimable with -reg3- -- at least if you had i.i.d. data; if you have
panel data, then you might get more efficient estimates by using that
panel structure (assuming that panel errors u are not correlated with
anything else in your model).

Suggested references in econometrics: Davidson and MacKinnon,
Estimation and Inference in Econometrics
(, Wooldrdige,
Econometric Analysis of Cross-Sectional and Panel Data
(, Greene,
Econometric Analysis
( -- the latter
is probably the lightest of them all, and has the best explanation of
the procedures to establish rank and order conditions of
identification. Suggested references on SEM vs multilevel/panel
models: Bauer, D, Estimating multilevel models as SEMs (JEBS,, Curran, P, Have
multilevel models been SEMs all along? (MBR, Suggested
reading on GLLAMM: see and Stata Press books by
R-H & S.

On Sun, Oct 5, 2008 at 1:17 PM, Jaime Gómez <> wrote:
> Dear Stata users
> I have a model in which the relationship between a predictor "x" and an
> outcome "y" is mediated by three factors ("r", "s" and "t"). I am only able
> to test whether one of the predictors ("r") mediates the relationship
> between "x" and "y" (I only have data on this mediating variable and I
> cannot get data on the other two). I would like to implement Baron and Kenny
> (1986)'s test for mediation. At least, this involves estimating the
> following system:
> Y=a1+b*r+c*x+epsilon1
> r=a2+d*x+epsilon2
> Given that the errors of the two equations are potentially correlated, it
> has been suggested that a 2SLS approach should be used. I have seen  that
> this could be done with ivregress, provided that I can find data on at least
> one variable that affects "r" and does not affect "y". My doubts are the
> following:
> 1)      Given that I have a triangular system, do I have to use the
> traditional approach implemented by ivregress or the "modified" proposed in
> ? Are both valid?
> 2)      How do I test for the hypothesis that the errors are correlated? I
> have seen that the use of a Hausman test is suggested in the literature, but
> I do not know how to implement this in Stata (specially in the case I use
> the "modified" approach)
> 3)      Given that I have panel data, could I take advantage of the panel
> structure of my data to correct for the fact that I do not have information
> on two of the mediating variables ("s" and "t")? Is there a procedure in
> Stata for that?
> Thanks a lot
> Jaime Gómez

Stas Kolenikov, also found at
Small print: I use this email account for mailing lists only.

*   For searches and help try:

© Copyright 1996–2023 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index