Richard Williams wrote:
> At 10:34 AM 2/20/2007, Ulrich Kohler wrote:
>>However, as an aside: I do not find the arguments for the adjusted R2 very
>>convincing. It is sometimes said that you have to be punished for
>> including additional variables in a model. But why? Because the R2
>> increases? Why do I need to be punished for this? It is just a simple
>> fact that I can explain more variance with an additional variable.
>> Punishment and especially the
>
> I don't think "punishment" is the original rationale for adjusted
> R^2, although that is often cited as one of its benefits. Rather,
> R^2 is biased upwards, especially in small samples. Adjusted R^2
> corrects for that.
>
> McClendon discusses this in "Multiple Regression and Causal
> Analysis", 1994, pp. 81-82.
>
> Basically he says that sampling error will always cause R^2 to be
> greater than zero, i.e. even if no variable has an effect R^2 will be
> positive in a sample. When there are no effects, across multiple
> samples you will see estimated coefficients sometimes positive,
> sometimes negative, but either way you are going to get a non-zero
> positive R^2. Further, when there are many Xs for a given sample
> size, there is more opportunity for R^2 to increase by chance.
>
> So, adjusted R^2 wasn't primarily designed to "punish" you for
> mindlessly including extraneous variables (although it has that
> effect), it was just meant to correct for the inherent upward bias in
> regular R^2.
Thank you, Richard, for this clarificaton. I wasn't aware of this. Obviously
my critique was overstated ("metaphysics"). The reason for my furor is that I
saw so many students that build models simply by adding variables that
increase the adjusted R2, believing that they end up with a model that holds
only "important" variables. I think this is a misunderstanding of what models
are about.
I am interested in the "causal" effect of a "key causal variable". Therefore I
must not include an independent variable into my model that is itself caused
by the key causal variables, but I must include all variables in the model
that are causes of the "key causal variable". However, sometimes I include a
specific variable that depends on the key causal variable in a second model.
In this case I look at the change of the key causal variable's effect, hoping
to learn something about the mechanisms through which the key causal variable
effects the dependent variable.
All this happens without reference to the model-fit. The whole process is
controlled by *hypotheses* about the causal order between variables. Building
models by looking at the model-fit shifts the attention away to a different
sort of reasoning. It is the sort of reasoning that leads directly to
automatic model building strategies (like "stepwise" and the like) -- and
somehow the arguments for the "adjusted" are just a little more related to
this direction.
My reasoning here is very much based on estimating the size of a "causal
effect". I am fully aware of the problems to estimate causal effects with
regression models (and related techniques). But besides all these critiques,
I think that the framework is without alternatives as a guiding idea for the
model building strategy.
I hope I didn't dwell to much on the obvious.
Many regards
Uli
--
Ulrich Kohler
kohler@wzb.eu
030/25491-361
*
* For searches and help try:
* http://www.stata.com/support/faqs/res/findit.html
* http://www.stata.com/support/statalist/faq
* http://www.ats.ucla.edu/stat/stata/