Must I use all of my exogenous variables as instruments when estimating
instrumental variables regression?
| Title |
|
Two-stage least-squares regression |
| Author |
Vince Wiggins, StataCorp
|
| Date |
July 2000; updated July 2005; minor revisions July 2011
|
Note: This model could also be fit with
sem, using
maximum likelihood instead of a two-step method.
You can find examples for recursive models fit with sem in
the “Structural models 2: Dependencies between endogenous
variables” section of [SEM] intro 4 — Tour of models.
Someone posed the following question:
I am estimating an equation:
Y = a + bX + cZ + dW
I then want to instrument W with
Q. I know the first-stage
regression is supposed to be
W = e + fX + gZ + hQ
(i.e., use all the exogenous variables in the first stage). Actually this
is automatically done if I use the
ivregress
command. However, I only want to use Q to
instrument W without using
X and Z in
the first stage. Is there a way I can do it in Stata? I can regress
W on Q and
get the predicted W, and then use it in the
second-stage regression. The standard errors will, however, be incorrect.
ivregress will not let you do this and,
moreover, if you believe W to be endogenous
because it is part of a system, then you must include
X and Z as
instruments, or you will get biased estimates for b, c, and d.
Consider the system
Y1 = a0 + a1*Y2 + a2*X1 + a3*X2 + e1 (1)
Y2 = b0 + b1*Y1 + b2*X3 + b3*X4 + e2 (2)
|
Warning:
Assume we are estimating structural equation (1); if
X1 and X2
are exogenous, then they must be kept as instruments or your
estimates will be biased. In a general system, such exogenous variables
must be used as instruments for any endogenous variables when the
instrumented value for the endogenous variables appears in an equation in
which the exogenous variable also appears.
|
Consider the reduced forms of your two equations:
Y1 = e0 + e1*X1 + e2*X2 + e3*X3 + e4*x4 + u1 (1r)
Y2 = f0 + f1*X1 + f2*X2 + f3*X3 + f4*x4 + u2 (2r)
where e# and f# are combinations of the a# and b# coefficients from (1) and
(2) and u1 and
u2 are linear combinations of
e1 and e2.
All exogenous variables appear in each equation for an
endogenous variable. This is the nature of simultaneous systems, so
efficiency argues that all exogenous variables be included as
instruments for each endogenous variable.
Here is the real problem. Take (1): the reduced-form equation for
Y2, (2r), clearly shows that
Y2 is correlated with
X2 (by the coefficient
f2). If we do not
include X2 among the instruments for
Y2, then we will have failed to account for
the correlation of Y2 with
X2 in its instrumented values. Since we
did not account for this correlation, when we estimate (1) with the
instrumented values for Y2, the coefficient
a3 will be forced to account for this
correlation. This approach will lead to biased estimates of both
a1 and a3.
For a brief reference, see Baltagi (2002). See the whole discussion of
2SLS, particularly the paragraph after equation 11.40, on page 277.
(I have no idea why this issue is not emphasized in more books.)
Failing to include X4 affects
only efficiency and not bias.
However, there is one case where it is not necessary to
include X1 and
X2 as instruments for
Y2. That is when the system is
triangular such that
Y2 does not
depend on Y1, but you believe it
is weakly endogenous because the disturbances are correlated between the
equations. You are still consistent here to do what
ivregress does and retain
X1 and X2
as instruments. They are, however, no longer required. Then you
could do what you suggested and just regress on the predicted instruments
from the first stage.
If you do use this method of indirect least squares, you will have to
perform the adjustment to the covariance matrix yourself. Consider the
structural equation
y1 = y2 + x1 + e
where you have an instrument z1
and you do not think that
y2 is a function of
y1.
The following example uses only z1
as an instrument for y2. Let’s begin
by creating a dataset (containing made-up data) on
y1, y2,
x1, and z1:
. sysuse auto
(1978 Automobile Data)
. rename price y1
. rename mpg y2
. rename displacement z1
. rename turn x1
Now we perform the first-stage regression and get predictions for the
instrumented variable, which we must do for each endogenous
right-hand-side variable.
. regress y2 z1
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 1, 72) = 71.41
Model | 1216.67534 1 1216.67534 Prob > F = 0.0000
Residual | 1226.78412 72 17.0386683 R-squared = 0.4979
-------------+------------------------------ Adj R-squared = 0.4910
Total | 2443.45946 73 33.4720474 Root MSE = 4.1278
------------------------------------------------------------------------------
y2 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
z1 | -.0444536 .0052606 -8.45 0.000 -.0549405 -.0339668
_cons | 30.06788 1.143462 26.30 0.000 27.78843 32.34733
------------------------------------------------------------------------------
. predict double y2hat
(option xb assumed; fitted values)
* perform IV regression
. regress y1 y2hat x1
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 2, 71) = 12.41
Model | 164538571 2 82269285.5 Prob > F = 0.0000
Residual | 470526825 71 6627138.38 R-squared = 0.2591
-------------+------------------------------ Adj R-squared = 0.2382
Total | 635065396 73 8699525.97 Root MSE = 2574.3
------------------------------------------------------------------------------
y1 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y2hat | -463.4688 117.187 -3.95 0.000 -697.1329 -229.8046
x1 | -126.4979 108.7468 -1.16 0.249 -343.3328 90.33697
_cons | 21051.36 6451.837 3.26 0.002 8186.762 33915.96
------------------------------------------------------------------------------
Now we correct the variance–covariance by applying the correct mean
squared error:
. rename y2hat y2hold
. rename y2 y2hat
. predict double res, residual
. rename y2hat y2 /* put back real y2 */
. rename y2hold y2hat
. replace res = res^2
(74 real changes made)
. summarize res
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
res | 74 7553657 1.43e+07 117.4375 1.06e+08
. scalar realmse = r(mean)*r(N)/e(df_r)
/* much ado about small sample */
. matrix bmatrix = e(b)
. matrix Vmatrix = e(V)
. matrix Vmatrix = e(V) * realmse / e(rmse)^2
. ereturn post bmatrix Vmatrix, noclear
. ereturn display
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
y2hat | -463.4688 127.7267 -3.63 0.001 -718.1485 -208.789
x1 | -126.4979 118.5274 -1.07 0.289 -362.8348 109.8389
_cons | 21051.36 7032.111 2.99 0.004 7029.73 35072.99
------------------------------------------------------------------------------
Reference
- Baltagi, B. H. 2002.
- Econometrics. New York: Springer.
|