Home  /  Resources & support  /  FAQs  /  Missing R-squared for 2SLS/IV

For two-stage least-squares (2SLS/IV/ivregress) estimates, why is the \(R^2\) statistic not printed in some cases?

For two-stage least-squares (2SLS/IV/ivregress) estimates, why is the model sum of squares sometimes negative?

For three-stage least-squares (3SLS/reg3) estimates, why are the \(R^2\) and model sum of squares sometimes negative?

Title   Negative and missing \(R^2\) for 2SLS/IV
Authors William Sribney, Vince Wiggins, and David Drukker, StataCorp

Background

Two-stage least-squares (2SLS) estimates, or instrumental-variables (IV) estimates, are obtained in Stata using the ivregress command.

ivregress sometimes reports no \(R^2\) and returns a negative value for the model sum of squares in e(mss).

Three-stage least-squares (3SLS) estimates are obtained using reg3. reg3 sometimes reports a negative \(R^2\) and model sum of squares. The discussion below focuses on 2SLS/IV; the issues for 3SLS are the same.

The short answer

Missing \(R^2\)'s, negative \(R^2\)'s, and negative model sum of squares are all the same issue.

Stata's ivregress command suppresses the printing of an \(R^2\) on 2SLS/IV if the \(R^2\) is negative, which is to say, if the model sum of squares is negative.

Whether a negative \(R^2\) should be reported or simply suppressed is a matter of taste. At any rate, the \(R^2\) really has no statistical meaning in the context of 2SLS/IV.

If it makes you feel better, you can compute the \(R^2\) yourself from the returned results (see An example section of the FAQ).

For 2SLS, some of the regressors enter the model as instruments when the parameters are estimated. However, because our goal is to estimate the structural model, the actual values, not the instruments for the endogenous right-hand-side variables, are used to determine the model sum of squares (MSS). The model's residuals are computed over a set of regressors different from those used to fit the model. This means a constant-only model of the dependent variable is not nested within the 2SLS model, even though the two-stage model estimates an intercept, and the residual sum of squares (RSS) is no longer constrained to be smaller than the total sum of squares (TSS). When RSS exceeds TSS, the MSS and the \(R^2\) will be negative.

The long answer—how can an \(R^2\) be negative?

The formula for \(R^2\) is

\(R^2\) = MSS/TSS

where

MSS = model sum of squares = TSS − RSS and
TSS = total sum of squares = sum of \((y-\bar{y})^2\) and
RSS = residual (error) sum of squares = sum of \((y-Xb)^2\)

For your model, MSS is negative, so \(R^2\) would be negative.

MSS is negative because RSS is greater than TSS. RSS is greater than TSS because \(\bar{y}\) is a better predictor of \(y\) (in the sum-of-squares sense) than \(Xb\)!

How can \(Xb\) be worse than \(\bar{y}\), especially when the model includes the constant term? At first glance, this seems impossible. But it is possible with the 2SLS/IV model.

Here are the background essentials:

Let \(Z\) be the matrix of instruments (say, \(z1\), \(z2\), \(z3\), \(z4\)).

Let \(X\) be the matrix of regressors (say, \(y2\), \(y3\), \(z3\), \(z4\), where \(y2\) and \(y3\) are endogenous and \(z3\) and \(z4\) are exogenous).

Let \(y\) be the endogenous variable of interest. That is, we want to estimate \(b\), where

\(y = Xb + error\)

Let \(P = Z (Z'Z)^{-1} Z'\) be the projection matrix into the space spanned by \(Z\).

2SLS/IV gives point estimates

\(b = ((PX)' PX)^{-1} (PX)' y\)

The coefficients are simply those from an ordinary regression but with the predictors in the columns of \(PX\) (the projection of \(X\) into \(Z\) space).

Let's assume you have two endogenous right-hand-side variables (\(y1\) and \(y2\)), two exogenous variables (\(x1\) and \(x2\)), and two instruments not in the structural equation (\(z1\) and \(z2\)). This makes your structural equation

\(y = (Y)B1 + (X)B2 + e\)

or

\(y = b1*y1 + b2*y2 + b3*x1 + b3*x2 + e\)

(where \(B1\) and \(B2\) are components of the vector of coefficients—\(b\)). If you run the following,

. regress y1 x1 x2 z1 z2
. predict yhat1
. regress y2 x1 x2 z1 z2
. predict yhat2
. regress y yhat1 yhat2 x1 x2

you will get exactly the coefficients of the 2SLS/IV model (but you will get different standard errors):

. ivregress 2sls y (y1 y2 = z1 z2) x1 x2

Now if we computed residuals after

. regress y yhat1 yhat2 x1 x2

the residuals would be

\(r = y - (PX)b\)

The sum of squares of these residuals would always be less than the total sum of squares.

But these are not the right residuals for 2SLS/IV. Because we are fitting a structural model, we are interested in the residuals using the actual values of the endogenous variables.

The correct 2SLS residuals are

\(e = y - Xb\)

Here there is no guarantee that the sum of these residuals squared is less than the total sum of squares. These residuals do not come from a model that nests a constant-only model of \(y\).

An example

Let's take a simple, and admittedly silly, example from our favorite dataset—auto.dta.

. sysuse auto, clear 
(1978 automobile data)

. ivregress 2sls price (mpg = foreign) headroom
 

Instrumental-variables 2SLS regression            Number of obs   =         74
                                                  Wald chi2(2)    =       1.15
                                                  Prob > chi2     =     0.5619
                                                  Root MSE        =     3363.6

price Coefficient Std. err. z P>|z| [95% conf. interval]
mpg 154.4941 239.2968 0.65 0.519 -314.519 623.5072
headroom 836.4137 821.6528 1.02 0.309 -773.9962 2446.824
_cons 371.36 7268.765 0.05 0.959 -13875.16 14617.88
Endogenous: mpg Exogenous: headroom foreign

There is your negative model sum of squares (−202135715). The model sum of squares is just the improvement over the sum of squares about the mean given by the full model. In this example, the sum of squared residuals from the model predictions is 837201111, whereas the sum of squared residuals about the mean of price is 635065396. By computing the model sum of square as

. display "MSS: " %15.0f 635065396 -  837201111
MSS:      -202135715

we can see that our model actually performs worse than the mean of price. Why didn't our constant keep this from happening? The coefficients are estimated using an instrument for mpg. Thus, the constant need not provide an intercept that minimizes the sum of squared residuals when the actual values of the endogenous variables are used.

Just to be sure, let's perform the sum of square computations by hand.

To get the sum of squared residuals for our model, type

. predict double errs, residuals

. generate double errs2 = errs*errs

. summarize errs2

Variable Obs Mean Std. dev. Min Max
errs2 74 1.13e+07 2.01e+07 3017.34 9.57e+07
. display "ESS: " %15.0f r(sum) ESS: 837201111

which agrees exactly with the returned results from ivregress.

. display "ESS: " %15.0f e(rss)
 ESS:       837201111

To get the total sum of squared residuals about the mean of price, type

. summarize price

Variable Obs Mean Std. dev. Min Max
price 74 6165.257 2949.496 3291 15906
. generate double pbarErr2 = (price - r(mean))^2 . summarize pbarErr2
Variable Obs Mean Std. dev. Min Max
pbarErr2 74 8581965 1.69e+07 .065924 9.49e+07
. display "TSS: " %15.0f r(sum) TSS: 635065396

So, our “hand” computations also give a model sum of squares of −202135715 and agree with the value returned by ivregress.

Is a negative \(R^2\) a problem?

What does it mean when RSS is greater than TSS? Does this mean our parameter estimates are no good? Not really. You can easily develop simulations where the parameter estimates from two-stage models are quite good, while the MSS is negative. Remember why we fit two-stage models. We are interested in the parameters of the structural equation—the elasticity of demand, the marginal propensity to consume, etc. If our two-stage model produces estimates of these parameters with acceptable standard errors, we should be happy—regardless of MSS or \(R^2\). If we were interested strictly in projections of the dependent variable, we should probably consider the reduced form of the model.

Another way of stating this point is that there are models where the distribution of 2SLS estimates of the parameters will be well approximated by its theoretical distribution but where the \(R^2\) computed from some samples will be negative. There are several ways of illustrating this point. Perhaps the most accessible is via simulation.

We simulate data from the model

(1) \(\quad y = 1 + - .1*x + e1 + e2\)

(2) \(\quad x = w + c1 + 0.5 * e1\)

(3) \(\quad z = 1.5*c1 + e3\)

where \(e1\), \(e2\), \(w\), and \(c1\) are all independent normal random variables. The \(c1\) term in (2) and (3) provides the correlation between \(x\) and \(z\). The \(e1\) term in (1) and (2) is the source of the correlation between \(x\) and the error term \((e1 + e2)\) for \(y\). The coefficient of \(-0.1\) is the parameter that we are trying to estimate. We are going to estimate this parameter with 2SLS using ivregress with y as the dependent variable, x as the endogenous variable, and z as the instrument for x. For each simulated sample, we construct y, x, and z using independent draws of the standard normal variables \(e1\), \(e2\), \(w\), and \(c1\) and (1)–(3). Then we use

. ivregress 2sls y (x = z)

to estimate the coefficient \(-0.1\). For each simulated sample, we record the following statistics:

b1 estimate of the coefficient \((−.1)\)
p \(p\) of the null hypothesis that \( b1 = -.1\)
reject if \(p\lt.05\) and \(0\) otherwise
r2 computed \(R^2\) (missing if mss \(\lt0\))
mss value of the model sum of squares
rho_x1e correlation between \(x1\) and \(e=e1+e2\)
rho_x1z1 correlation between \(x1\) and \(z1\)
fsf first-stage F statistic
p_fsf p-value from the first-stage F statistic

The Stata code for drawing 2,000 simulations of this model, estimating the coefficient \(−0.1\), computing the statistics of interest, and finally, summarizing the results, is saved in the file negr2.do. Each simulated sample contains 1,000 observations, so the results should not be attributed to a small sample size.

Here are the results we obtained with the summarize command:

. summarize

Variable Obs Mean Std. Dev. Min Max
b1 2000 -.0981982 .0541345 -.2771809 .0765793
p 2000 .4945649 .2884685 .0002706 .9995125
reject 2000 .0485 .214874 0 1
r2 64 .0068443 .0063426 .000051 .0264567
mss 2000 -78.4407 49.08486 -273.4773 47.94914
rho_x1e 2000 .235859 .0300348 .1194255 .3460462
rho_x1z1 2000 .5556971 .0216154 .4764362 .6183904
fsf 2000 448.584 50.32493 293.0595 617.9501
p_fsf 2000 2.62e-34 7.49e-33 0 3.29e-31

The results for rho_x1e, rho_x1z1, fsf, and p_fsf indicate the correlations between the endogenous variable and the error term and between the endogenous variable and its instrument are reasonable and there is no weak-instrument problem. The results for b1, p, and reject indicate that the mean estimate of the coefficient on \(x\) is very close to its true value of \(−0.1\) and that there is no size distortion of the test that the coefficient on \(x = -0.1\). In short, the distribution of the estimates, b1, is very well approximated by its theoretical asymptotic distribution. Together, these results imply that the 2SLS estimator is performing according to the theory in these simulations.

There are only 64 observations on r2 because there are 1,936 observations in which mss \(\lt{0}\) for the result of

. count if mss < 0
1,936

Thus, the results illustrate that there is at least one model for which the distribution of the 2SLS estimates of the parameters is very well approximated by its asymptotic distribution but that the \(R^2\) will be negative in most of the individual samples. To obtain more models that produce the same qualitative results, simply change the coefficient \(-0.1\) by a small amount. As one would expect, increasing the coefficient \(-0.1\) reduces the fraction of the simulated samples that produce a negative \(R^2\).