This FAQ is based on material that appeared on
Statalist.
How can I get an R-squared value when a Stata command does not supply one?
|
Title
|
|
Do-it-yourself R-squared
|
|
Author
|
Nicholas J. Cox, Durham University, UK
|
|
Date
|
September 2003
|
1. The problem
Users often request an R-squared value when a regression-like command
in Stata appears not to supply one.
2. Warning: caveat lector
This FAQ looks at the question generally and discursively. There is a
practical kernel explaining something that you can usually do and that is
often of some help. Nevertheless, the FAQ is no substitute for the
technicalities that may be crucial for particular models.
3. Why is R-squared not supplied?
If Stata refuses to give you an R-squared, there may be a good
explanation other than that the developers never got around to implementing
it. Perhaps the R-squared does not seem to be a good measure for this
model, on some technical grounds. You have to consult the literature or an
expert to take this further, unless you are an expert, in which case you
probably disagree with the other experts.
4. What you can usually do
There is usually something you can do for yourself: calculate the
correlation between the observed response and the predicted response and
then square it. Here is the general idea illustrated:
. sysuse auto, clear
. regress weight length
. predict weightp if e(sample)
. corr weight weightp if e(sample)
. di r(rho)^2
Try it and see. Naturally, in this example, you get an R-squared from
regress anyway,
so you need not do this. But similarly, you can check that you get the same
result, in both cases 0.8949, to 4 decimal places.
You can also use the correlation coefficient itself, which here we will call
R.
Two crucial details to note:
- The predicted response must be on the same scale as the response, up to a
linear transformation.
- Use if e(sample) to make sure everything is done for the
estimation sample only. (Here, the second if e(sample) is
redundant, given the first, but it does no harm, especially if it reminds
you which observations are being used.)
This way of doing things opens up some other elementary possibilities, which
become obvious when pointed out but are often overlooked. You can now get a
basic graph of observed versus predicted responses, such as
. twoway scatter weight weightp || function y = x, ra(weightp) clpat(dash)
Sometimes this graph makes it clearer why you got a surprising value of
R-squared. Similarly, you could calculate residuals and plot against
the predicted responses. Such graphs can always be drawn, whatever the
complexities of the model, and they can be useful.
5. Positive virtues
It may be worth reminding ourselves of some positive virtues of
R-squared (or R). In particular, Zheng and Agresti (2000)
discuss the correlation between the response and the fitted response as a
general measure of predictive power for generalized linear models (GLMs).
Some of their arguments carry over to other classes of models. This
measure has the advantage of referring to the original scale of measurement,
of applying to all types of GLMs, and of being familiar to many users of
statistics. Preferably, it should be used as a comparative measure for
different models applied to the same dataset, given that restrictions on
values of the response may imply limitations on its value (e.g., Cox and
Wermuth 1992).
For an arbitrary GLM, this correlation is invariant under a location-scale
transformation. It is the positive square root of the average proportion of
variance explained by the predictors. However, again for an arbitrary GLM,
it need not equal the positive square root of other definitions of
R-squared (as will be discussed in a moment); and it need not be
monotone increasing in the complexity of the predictors, although in
practice that is common. The correlation is necessarily sensitive to
outliers.
6. Beware varieties of (pseudo) R-squared
For many models, especially those with categorical responses, there are
frequently several different supposed approximations or analogues to
R-squared. Often they are labeled “pseudo”. Beware that
they typically do not agree, even roughly. You need to look at the
literature in your field and to realize that software and papers may often
be unclear about precisely what was calculated. Long and Freese (2003,
91–94) and Hardin and Hilbe (2001, 45–49) are excellent sources
of guidance on the animals in the zoo.
Thus, if you do this after
logit, you will
find that the squared correlation between observed and predicted is not what
logit reports as pseudo–R-squared (the formula for
pseudo–R-squared is documented in [R] maximize).
7. A single figure of merit only
Even if you now have an R-squared, it is only a single figure of
merit. Resist the temptation to use it as a weapon or as a comforter. Your
R-squared may be high because your model codifies tautology or
truism. Predicting today's temperature from yesterday's temperature would
get you a high R-squared and might be a practical model for some
purposes, but it is not a contribution to science at this time.
Alternatively, your R-squared may be low, but no indictment of your
model, if the field is refractory and your dataset is problematic. As
R-squared never decreases as you add covariates (predictors), a high
R-squared may go with a model that on scientific or statistical
grounds has too many covariates.
There is likely to be a great deal of information about the limitations of
the model, with implications for how it can be improved, in the detailed
estimation results and residuals you can usually get from Stata, including
graphical as well as numeric output. There is almost no such information in
an R-squared.
8. A descriptive measure only
Even if you now have an R-squared, it is at best a descriptive
measure. It considers only the information on which it is based, no more and
no less, and says nothing about the structure of the data in any sense
(e.g., dependence or cluster structure). If you attempt to make inferences
based on R-squared, or on R, they may be highly fragile,
unless somehow they respect the character of the model. This applies to
bootstrap and jackknife work as well.
References
- Cox, D. R. and N. Wermuth. 1992.
- A comment on the coefficient of
determination for binary responses. American Statistician 46: 1–4.
- Hardin, J. and J. Hilbe. 2001.
- Generalized Linear Models and
Extensions. College Station, TX: Stata Press.
-
Long, J. S. and J. Freese. 2003.
-
Regression Models for Categorical
Dependent Variables Using Stata, Revised Edition.
College Station, TX:
Stata Press.
-
Zheng, B. and A. Agresti. 2000.
- Summarizing the predictive power of a
generalized linear model. Statistics in Medicine 19: 1771–1781.
|
FAQs
What's new?
Statistics
Data management
Graphics
Programming Stata
Mata
Resources
Internet capabilities
Stata for Windows
Stata for Unix
Stata for Mac
Technical support
|