Bookmark and Share

Notice: On March 31, it was announced that Statalist is moving from an email list to a forum. The old list will shut down at the end of May, and its replacement, statalist.org is already up and running.


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Re: st: RE: regression r(103): too many variables


From   Paul Higgins <pahiggins@LRCA.com>
To   "'statalist@hsphsun2.harvard.edu'" <statalist@hsphsun2.harvard.edu>
Subject   RE: Re: st: RE: regression r(103): too many variables
Date   Fri, 26 Feb 2010 12:41:50 -0600

I'm not going to continue this colloquy much longer.  But I thought I'd reply in a general way, not just to Steve but to others who commented to me (some privately) that they too had encountered other, similar "rules" about how much data is necessary before one can properly fit a given regression, how correlated the data can or cannot be, and so on.

Rules of thumb often do have some value, which isn't surprising considering that they distill the practical experiences of a lot of people doing similar types of things over time.  So I don't meant to denigrate them.  However, that said, we need to recognize that regression analysis isn't magic: there is no high priesthood (sorry guys!) and only a very few hard-and-fast rules about what is the "right" or the "wrong" way to proceed.  I'm going to paraphrase the late, great Arthur S. Goldberger here, who as usual had some very sensible things to say on this subject:

The conditional expectation function (CEF) is the key feature of a multivariate population for any analyst wanting to study relationships among groups of variables.  The CEF describes how the average value of one variable varies with values of the other variables in the population.  Another key feature of a multivariate population is the linear projection, or best linear predictor (BLP): it provides the best linear approximation to the CEF.  Alternative regression models arise according to the sampling scheme used to get sample draws from the population.

Regardless of whether a regression specification is "right" or "wrong", least-squares regression will typically estimate something useful about the population in question, namely the BLP.  Instead of emphasizing the bias, inconsistency, or inefficiency of least-squares, one can consider whether or not the population feature that it -does- consistently estimate is an interesting one.

And that is about all I will say about that.

Cheers, 

Paul

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of sjsamuels@gmail.com
Sent: Thursday, February 25, 2010 3:14 PM
To: statalist@hsphsun2.harvard.edu
Subject: Re: Re: st: RE: regression r(103): too many variables

I'm not expert in this area, but I think that the issue is not only
degrees of freedom per se, but the amount of information per
predictor.  See references in Frank Harrell, Regression modeling
strategies: with applications to linear models, logistic regression
and survival analysis. New York: Springer; 2001, p. 61, and Green SB.
How many subjects does it take to do a regression analysis? Multivar
Behav Res 1991; 26: 499-510.

The formulas that illustrate your point about degrees of freedom
assume that the standard deviations of the error terms are identical
and that the error terms are uncorrelated. Violations of the first
assumption can be addressed with the "robust" option of -regress-.  I
think that with sequential data, violation of the second assumption
should be of special concern.  See: "The Problem Of Unsuspected Serial
Correlations, Or The Violation Of The Independence Assumption", p.
387. F. R. Hampel and E. M. Ronchetti and P. J. Rousseeuw and W. A.
Stahel (1986) Robust Statistics: The Approach Based on Influence
Functions, Wiley, NY

Good luck!

Steve


On Wed, Feb 24, 2010 at 5:42 PM, Paul Higgins <pahiggins@lrca.com> wrote:
> Steve, I have to disagree with you about your "rule of thumb."
>
> One nice thing about regression analysis is that it generates its own diagnostic statistics that indicate whether or not a model was estimated using "too few observations" or not.  The error degrees of freedom (EDF), which is just a fancy name for the number of observations minus the number of estimated parameters in a model, is used to standardize most of the statistics we use to assess our models.  I will happily stipulate that the fewer the degrees of freedom, the harder it becomes to make meaningful inferences, ceteris paribus.  But to my knowledge there is no general rule of the sort you stated.
>
> To make my point more specific, consider the standard error of the regression: SER = e'e/EDF.  The SER figures into, for example, the estimated variance-covariance matrix of the least-squares vector: Est.Var[b] = SER * inv(X'X).  Since they are the values found along the main diagonal of that matrix, the standard errors of the individual coefficients, and thus the associated t statistics, are functions of the SER, too.  (So, everything else equal, as EDF falls, so too do the model's t statistics.)  Similarly, EDF also finds its way into the F statistics used for making inferences involving linear combinations of parameters.  (So, the lower is EDF, the smaller the F statistics will be, everything else equal.)  And so on.
>
> This argument does not apply to all diagnostic statistics (the unadjusted R-squared comes to mind).  But it is true for most of them.
>
> Paul
>
> P.S.: One of the regressions I ran using code of the form I shared with this list had EDF equal to 13800 - 2500 = 11300 (in round numbers): the ratio obs/coeffs was roughly 5.5.  And my t statistics and F statistics punished me for it to an extent.  But as long as I proceed with a full understanding of all of the above, there is no obvious reason -not- to perform the analysis, assuming I have theoretical reasons for specifying the model in this way.  Saying so simply acknowledges the applied statistician's dilemma: to make the most of limited resources.
>
> -----Original Message-----
> From: owner-statalist@hsphsun2.harvard.edu [mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of sjsamuels@gmail.com
> Sent: Wednesday, February 24, 2010 3:43 PM
> To: statalist@hsphsun2.harvard.edu
> Subject: [SUSPECT] Re: st: RE: regression r(103): too many variables
> Importance: Low
>
> Now that you've figured out what caused the error message, perhaps you
> should reconsider your proposed analysis.  You have too few
> observations to fit 2500 predictors.The rule of thumb, I believe, is
> that the ratio of observations to coefficients should be greater than
> 10:1.
>
> Steve
>
> On Wed, Feb 24, 2010 at 8:01 AM, Paul Higgins <pahiggins@lrca.com> wrote:
>> Hi all,
>>
>> Thanks for all of your suggestions: they were a big help.  My code contained an error that is probably a classic newbie misstep: misusing hyphens when making lists of variables.  The rhs of my regression contained thousands of interactions between sets of dummy variables (96 dummies representing quarter-hour time increments interacted with 22 date values of special import for the problem I was investigating, yielding a total of 2112 altogether just for that one pair of variables).  To construct these, I used code of the following form:
>>
. . . <snip> . . .

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   Site index