Title | Do-it-yourself R-squared | |
Author | Nicholas J. Cox, Durham University, UK | |
Date | September 2003 |
Users often request an R-squared value when a regression-like command in Stata appears not to supply one.
This FAQ looks at the question generally and discursively. There is a practical kernel explaining something that you can usually do and that is often of some help. Nevertheless, the FAQ is no substitute for the technicalities that may be crucial for particular models.
If Stata refuses to give you an R-squared, there may be a good explanation other than that the developers never got around to implementing it. Perhaps the R-squared does not seem to be a good measure for this model, on some technical grounds. You have to consult the literature or an expert to take this further, unless you are an expert, in which case you probably disagree with the other experts.
There is usually something you can do for yourself: calculate the correlation between the observed response and the predicted response and then square it. Here is the general idea illustrated:
. sysuse auto, clear . regress weight length . predict weightp if e(sample) . corr weight weightp if e(sample) . di r(rho)^2
Try it and see. Naturally, in this example, you get an R-squared from regress anyway, so you need not do this. But similarly, you can check that you get the same result, in both cases 0.8949, to 4 decimal places.
You can also use the correlation coefficient itself, which here we will call R.
Two crucial details to note:
This way of doing things opens up some other elementary possibilities, which become obvious when pointed out but are often overlooked. You can now get a basic graph of observed versus predicted responses, such as
. twoway scatter weight weightp || function y = x, ra(weightp) clpat(dash)
Sometimes this graph makes it clearer why you got a surprising value of R-squared. Similarly, you could calculate residuals and plot against the predicted responses. Such graphs can always be drawn, whatever the complexities of the model, and they can be useful.
It may be worth reminding ourselves of some positive virtues of R-squared (or R). In particular, Zheng and Agresti (2000) discuss the correlation between the response and the fitted response as a general measure of predictive power for generalized linear models (GLMs). Some of their arguments carry over to other classes of models. This measure has the advantage of referring to the original scale of measurement, of applying to all types of GLMs, and of being familiar to many users of statistics. Preferably, it should be used as a comparative measure for different models applied to the same dataset, given that restrictions on values of the response may imply limitations on its value (e.g., Cox and Wermuth 1992).
For an arbitrary GLM, this correlation is invariant under a location-scale transformation. It is the positive square root of the average proportion of variance explained by the predictors. However, again for an arbitrary GLM, it need not equal the positive square root of other definitions of R-squared (as will be discussed in a moment); and it need not be monotone increasing in the complexity of the predictors, although in practice that is common. The correlation is necessarily sensitive to outliers.
For many models, especially those with categorical responses, there are frequently several different supposed approximations or analogues to R-squared. Often they are labeled “pseudo”. Beware that they typically do not agree, even roughly. You need to look at the literature in your field and to realize that software and papers may often be unclear about precisely what was calculated. Long and Freese (2003, 91–94) and Hardin and Hilbe (2001, 45–49) are excellent sources of guidance on the animals in the zoo.
Thus, if you do this after logit, you will find that the squared correlation between observed and predicted is not what logit reports as pseudo–R-squared (the formula for pseudo–R-squared is documented in [R] maximize).
Even if you now have an R-squared, it is only a single figure of merit. Resist the temptation to use it as a weapon or as a comforter. Your R-squared may be high because your model codifies tautology or truism. Predicting today's temperature from yesterday's temperature would get you a high R-squared and might be a practical model for some purposes, but it is not a contribution to science at this time. Alternatively, your R-squared may be low, but no indictment of your model, if the field is refractory and your dataset is problematic. As R-squared never decreases as you add covariates (predictors), a high R-squared may go with a model that on scientific or statistical grounds has too many covariates.
There is likely to be a great deal of information about the limitations of the model, with implications for how it can be improved, in the detailed estimation results and residuals you can usually get from Stata, including graphical as well as numeric output. There is almost no such information in an R-squared.
Even if you now have an R-squared, it is at best a descriptive measure. It considers only the information on which it is based, no more and no less, and says nothing about the structure of the data in any sense (e.g., dependence or cluster structure). If you attempt to make inferences based on R-squared, or on R, they may be highly fragile, unless somehow they respect the character of the model. This applies to bootstrap and jackknife work as well.