Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: RE: Testing for instrument relevance and overidentification when the endogeneous variable is used in interaction terms

From   Jason Wichert <>
Subject   Re: st: RE: Testing for instrument relevance and overidentification when the endogeneous variable is used in interaction terms
Date   Tue, 11 Jun 2013 10:53:40 +0200

Apologies in advance for the long post. But since both Mark and I were
about to lose track of the issue and following a bunch of additional
more or less profound thoughts I put into the matter, here’s a rather
extensive synthesis of the matter. Maybe it will attract some
additional inputs by other readers. At the very least, it might help
other PhD students with similar issues.

To summarize:

In my model, the dependent variable “y” is influenced by two exogenous
variables (ex1, ex2). ex1 and ex2 are distinct constructs, so at the
very least interactions between the two can be ruled out. However, the
effects of both ex1 and ex2 on y are moderated by my endogenous
variable “en” in different fashion. For some further empirical

There’s an inverted u-shaped association between ex1 and y, and a
negative association between ex2 and y. The effect of “en” on y seems
negative, if significant at all. Taking interactions into account, the
relationship (listing only significant coefficients) looks somewhat
like the following according to preliminary OLS and 2SLS results:

y = 0.363 ex1 – 0.032 (ex1)^2 – 0.183 ex2 – 0.007 ex1_en + 0.080
ex2_en + 0.001 (ex1)^2_en – 0005 (ex2)^2_en + controls


y = dependent variable
 ex1 = exogenous variable 1
 ex2 = exogenous variable 2
 (ex1)^2 = exogenous variable 1 squared
 (ex2)^2 = exogenous variable 2 squared
 en = endogenous variable
 ex1_en = interaction of ex1 and en
 ex2_en = interaction of ex2 and en
 (ex1)^2_en = interaction of exogenous variable 1 squared and en
 (ex2)^2_en = interaction of exogenous variable 2 squared and en
 controls = a bunch of control variables including fixed industry- and
year or industry-year effects

Thus, the endogenous variable seems to have a negative moderating
effect on the relationship between ex1 and y, which is attenuated at
higher levels of ex1. Furthermore, the endogenous variable seems to
have an overall positive moderating effect on the relationship between
ex2 and y, which is attenuated at higher levels of ex2. Apart from
control variables and instruments (the strength and exogeneity of
which stands to debate), the endogenous variable is highly influenced
by both ex1 and ex2, predominantly by ex1. Hence, while I can’t rule
out omitted variable bias, at the very least my analyses seem subject
to be simultaneity bias, calling for instrumental variable approaches.

The basic problem is the multiple non-linear endogenous interaction
terms, i.e. ex1_en, ex2_en, (ex1)^2_en, and (ex2)^en. Simply
predicting “en” as “enhat” (to denote first stage predicted value of
the endogenous variable) in a first stage regression and plugging the
fitted values into the second stage equation as well as simply
generating interaction terms of enhat with the exogenous variables and
their squared values is out of question; this would be a classical
case of what Wooldridge refers to as “forbidden regression”. Since the
interaction terms are nonlinear functions of an endogenous variable,
they need to be predicted/instrumented as well. Hence, for up to five
endogenous variables (i.e., “en” as well as the two interactions each
with ex1 and ex2), we need a total of at least five instruments.

In the following, I try to summarize the different approaches
discussed here on statalist and found in the literature, as well as
some caveats and pitfalls which we already discussed, I have
encountered or thought of.

a) let’s call it the “ignorance is bliss” approach: I found a comment
stating endogeneity bias to be attenuated when the endogenous variable
is included in interaction terms with continuous exogenous variables.
The cited paper the author of this comment referred to is an
unpublished working paper I couldn’t get access to so far, and absent
any proof or examinations, my intuition has a hard time taking this
statement at face value. In my case, where ex1 has a largely positive
effect on both the dependent variable as well as the endogenous
variable, the significantly negative interaction between the two seems
to indicate a lesser problem of endogeneity. In contrast, endogeneity
pretty much does seem to be an issue considering ex2 has a negative
effect on the dependent variable, whereas the interaction of ex2 and
the endogenous variable has a positive effect as well. In particular,
I’m afraid this positive interaction effect might be caused by the
positive association between ex1 and “en”. If anyone has any further
evidence or intuition concerning the “endogeneity is less of a problem
when the endogenous variable is interacted with continuous exogenous
variables” statement, please chime in.

b)  the standard approach: denoting a potential instrument as “z”, the
main approach discussed here on statalist for interactions of
endogenous variables is instrumenting both the endogenous variable as
such, as well as its interaction term. As additional instrument(s),
interaction terms of the exogenous variable and the instrument are

ivreg2 y ex (en en_ex = z ex_z)

In my case, this would translate to the following stata command,
leaving control variables aside and using just 1 instrument:

[1] ivreg2 y ex1 ex2 (en ex1_en ex2_en (ex1)^2_en (ex2)^2_en = z ex1_z
ex2_z (ex1)^2_z (ex2)^2_z)

In the case of say two instruments, this already expands to

[2] ivreg2 y ex1 ex2 (en ex1_en ex2_en (ex1)^2_en (ex2)^2_en = z1
ex1_z1 ex2_z1 (ex1)^2_z1 (ex2)^2_z1 z2 ex1_z2 ex2_z2 (ex1)^2_z2

To start my analyses, I’m planning to continuously extend the
presented model, from incorporating the endogenous variable alone,
such as

[3] ivreg2 y ex1 ex2 (en = z1 z2) – let’s dub this the „base case“

to the models incorporating the interactions with the linear and then
also the squared values of ex1 and ex2.

When including solely the endogenous variable without any interaction
terms, i.e. running [3], all test statistics are more than fine;
instruments seem strong (not just by the F-stat, but also by the
significance levels of the respective instruments in the first stage
regression), exogenous, and not redundant (although I didn’t look into
the latter too much). In particular, the plan is to present
F-statistics, the Sargan/Hansen statistic (when more than one
instrument is being used), as well as an endogeneity test
(Durbin/Wu/Hausman) for “en”.

In the “extended cases” [1] and [2], there are a total of five
endogenous variables and first stage regressions. So simply reporting
individual first stage F-statistics is out of the question. Apart from
the endogeneity tests as well as the Sargan/Hansen statistic (in the
case of overidentification), Cragg/Donald F-statistics for weak
identification and – depending on those results – Kleibergen/Paap
F-statistics for underidentification come to mind. If no signs of
underidentification show up, Anderson/Rubin statistics can be left
aside as well. However, Angrist/Pischke statistics for the individual
endogenous regressors will be reported.

Some brief notes  and thoughts on those statistics:

Cragg/Donald F-stat:
 - tests the null of weak identification, so we want to reject
 - equivalent to the regular first stage F-stat in the case of just
one endogenous variable
 - refers to all endogenous regressors together, doesn’t indicate
which one is weakly identified. Failure to reject might be due to all
endogenous variables or just one.
 - Stock/Yogo (2005) provide critical values for relative and size
bias for up to 3 endogenous regressors and 30 instruments. Since those
values all seem to be in a certain ballpark, being a bit “hand-wavey”
as Mark termed it will probably be considered appropriate.
 - only a viable statistic unless errors are heteroskedastic or
serially correlated; not an issue in my case

Kleibergen/Paap F-Stat:
 - tests the null of underidentification. If we already reject weak
identification according to C/D, K/P is unnecessary.
 - again, with a single endogenous variable, simply the regular first
stage F-stat
 - doesn’t necessarily rely on i.i.d. assumption, but not an issue here anyway
 - also, refers to all endogenous variables together. So failure to
reject doesn’t tell me which regressor(s) are underidentified.

In my case, these F-stats pretty much implode from just one endogenous
variable to extended sets following the approaches [1] or [2]. I’ve
come across some other empirical papers using a similar or the same
procedure, and the ones that actually do report F-statistics (C/D or
K/P) on the extended cases also show sharp declines in the F-values,
indicating weak or underidentification of some form. I’m just
wondering how come.

The way I understand the null of the C/D F-stat, it tests whether the
instruments are _jointly_ only weakly correlated to the endogenous
regressors. Could failure to reject (as indicated by F-stats around 8)
be due to the little correlation between say ex1_z to ex2_en,
considering ex1 and ex2 are completely distinct constructs, and
there’s a couple of these cases? At the same time, (ex2)^2_z does
little to explain ex2_en which ex_z doesn’t already do, so some of my
instruments barely differ using the main approach [1]/[2]. Completely
dropping everything related to either ex1 or ex2 didn’t really help
any, either, and my research has to incorporate both ex1 and ex2. Does
anyone have any intuition on how the C/D F-stat *should* be expected
to behave going from [3] to [2]? Again, other papers with similar
approaches I found all document sharp declines in the C/D F-stat. But
with my limited understanding of the matter, I’d rather not base my
reasoning on my own econometric intuition and simple reference to
other papers that simply took the statistics at face value.

The Sargan/Hansen statistic of overidentification also tests whether
_any_ of the instruments fail the orthogonality criterion. In the base
case I can confidently fail (interesting wording) to reject the null.
Since I know ex1 and ex2 are highly correlated to the dependent
variable, so are by construction the newly generated instruments such
as ex1_z. Hence, does this seem plausible as cause for the sudden
rejection of the null? Comparing results of different sets of
instruments, as suggested by Mark, did not help me so far.

The Angrist/Pischke statistic for weak- and underidentification of
individual endogenous regressors, also provides mixed evidence.
Extending the model to incorporate en, ex1_en, ex2_en as endogenous
terms instrumented accordingly with a set of (interacted) instruments
found valid for “en” actually found “en” to be identified the worst,
as indicated by an weakid AP F-stat of around 2.5, with weakid AP
F-stats of 5.7 and 19.6 for ex1_en and ex2_en respectively. The AP
underid F-stats were quite high, with p-values between 0 and 0.01.
According to the fully extended case, i.e. [2] but with z1, z2, and
z3, the AP F-stats for underidentification exhibit p-values between 0
and 0.001, whereas the F-stats for weak identification range between 3
and 8.

Naively assuming the instruments do their job quite well (after all,
they did fairly well in explaining “en” as sole endogenous variable)
and the poor results of the aforementioned could be explained, would
any other statistics come to mind, void of those problems (if there
are “problems” in the calculations of those statistics at all)? In
particular I’d think about a comparison of the partial R² to Shea’s
partial R², or Shea’s partial R² as such.

c) generating instruments of the interacted exogenous variables with
fitted values of the endogenous variable: another approach we already
discussed generates fitted values of the endogenous variable first,
let’s call them “enhat”. After this preliminary step, in the main 2SLS
procedure, instruments are constructed by interacting this fitted
value with ex1, ex2, (ex1)^2, and (ex2)^2. So the Stata command is

ivreg2 y ex1 ex2 (ex1)^2 (ex2)^2 (en ex1_en ex2_en (ex1)^2_en
(ex2)^2_en = enhat enhat_ex1 enhat_ex2 enhat_(ex1)^2 enhat_(ex2)^2 )

If I’m not mistaken, this procedure is also recommended by Wooldridge,
and I’ve seen it applied in two empirical papers. One only showed some
A/P F-stats for an extended case with multiple endogenous interaction
terms, the other paper showed C/D and K/P F-stats that were greatly
reduced from the base case to their version of extended case.

d) A variation of c): since a preliminary regression already predicted
fitted values for the endogenous variable, would it be viable to
include those in the second stage directly instead of using them as an
instrument? At the same time, I’d think of excluding the endogenous
variable from the set of instrumented variable. In essence, instead of

ivreg2 ex1 ex2 (en ex1_en ex2_en = enhat enhat_ex1 enhat_ex2)

I would run

ivreg2 ex1 ex2 enhat (ex1_en ex2_en = enhat_ex1 enhat_ex2)

So enhat would be an included instrument instead of an excluded
instrument. Therefore, it would still contribute to instrument the
endogenous interaction terms. And in my humble/naïve opinion, it would
be sufficiently predicted already, since it’s the fitted value.

Are there any upsides or downsides to this approach in contrast to c) ?

e) Using separate sets of instruments to predict separate endogenous
interaction terms: in one of the papers by an influential author I
found, the following approach seems to have been employed. They start
with getting fitted values of the endogenous variable, enhat. In a
second step, they run multiple first stage regressions to instrument
the endogenous interaction terms. In particular, they run

ex1_en = enhat_ex1 + controls + original uninteracted instruments

ex2_en = enhat_ex2 + controls + original uninteracted instruments

Thus, they use selective instruments for each endogenous interaction
term, e.g. not using enhat_ex2 to predict ex1_en. This seems to pretty
much resemble the recommendation from Wooldridge according to approach
“c)” as well, albeit using not each generated instrument for each
endogenous interaction term, but only the corresponding ones. I had
already thought about a similar procedure somewhere along approach
“b)”, since I’m afraid some of my instruments in that case are either
redundant or too much/too little correlated to other endogenous
variables or the dependent variable for the aforementioned test
statistics to show the results they do. Therefore, this might
attenuate my problem of too many unnecessary (in some cases)

On a side note, I do not know whether they include “enhat” as such in
those first stage regressions of the interaction terms. If this
approach was valid, would enhat need to be included as instrument for
the interaction terms, or would the interactions such as enhat_ex2 for
ex2_en suffice?

Unfortunately, the authors do not show any multivariate statistics,
but only statistics for the prediction of the single endogenous
regressor en/enhat.

f) control function approach: Wooldridge suggests predicting (by all
the exogenous regressors and the to-be-excluded instruments) the
endogenous variable in a first step and then including the residual
from this regression in the main equations in which the endogenous
variable is used. He explains (taken from some lecture slides) “If we
believe y2 [the dependent variable] has a linear RF with additive
normal error independent of z [an exogenous variable], the addition of
v2_hat [the predicted residual] solves the endogeneity problem
regardless of how y2 appears.” Unfortunately, even in his books I
didn’t find too many explanations on this procedure, or at least not
too many explanations I could make sense of. Following the base case
[3], the results of the control function approach are equivalent.
Toying around with my extended case (but only non-quadratic
interactions), the results between “c)” and this control function
approach clearly differ when the predicted residual is included.
Unless I made some stupid mistake, coefficients of the endogenous
terms are rather similar when also including interactions of ex1/ex2
and the predicted residual, and standard errors differ quite a bit as
well. I’d greatly appreciate if anyone could chime in and provide some
insights on how the control function approach is supposed to be
incorporated with multiple endogenous (interaction) variables, i.e.
whether and how interactions of the residual need to be incorporated
as well.

 Any feedback - no matter how extensive - is highly appreciated,

*   For searches and help try:

© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index