Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Re: st: RE: regression r(103): too many variables

From	[email protected]
To	[email protected]
Subject	Re: Re: st: RE: regression r(103): too many variables
Date	Thu, 25 Feb 2010 16:13:49 -0500

I'm not expert in this area, but I think that the issue is not only
degrees of freedom per se, but the amount of information per
predictor.  See references in Frank Harrell, Regression modeling
strategies: with applications to linear models, logistic regression
and survival analysis. New York: Springer; 2001, p. 61, and Green SB.
How many subjects does it take to do a regression analysis? Multivar
Behav Res 1991; 26: 499–510.


The formulas that illustrate your point about degrees of freedom
assume that the standard deviations of the error terms are identical
and that the error terms are uncorrelated. Violations of the first
assumption can be addressed with the "robust" option of -regress-.  I
think that with sequential data, violation of the second assumption
should be of special concern.  See: "The Problem Of Unsuspected Serial
Correlations, Or The Violation Of The Independence Assumption", p.
387. F. R. Hampel and E. M. Ronchetti and P. J. Rousseeuw and W. A.
Stahel (1986) Robust Statistics: The Approach Based on Influence
Functions, Wiley, NY

Good luck!

Steve


On Wed, Feb 24, 2010 at 5:42 PM, Paul Higgins <[email protected]> wrote:
> Steve, I have to disagree with you about your "rule of thumb."
>
> One nice thing about regression analysis is that it generates its own diagnostic statistics that indicate whether or not a model was estimated using "too few observations" or not.  The error degrees of freedom (EDF), which is just a fancy name for the number of observations minus the number of estimated parameters in a model, is used to standardize most of the statistics we use to assess our models.  I will happily stipulate that the fewer the degrees of freedom, the harder it becomes to make meaningful inferences, ceteris paribus.  But to my knowledge there is no general rule of the sort you stated.
>
> To make my point more specific, consider the standard error of the regression: SER = e'e/EDF.  The SER figures into, for example, the estimated variance-covariance matrix of the least-squares vector: Est.Var[b] = SER * inv(X'X).  Since they are the values found along the main diagonal of that matrix, the standard errors of the individual coefficients, and thus the associated t statistics, are functions of the SER, too.  (So, everything else equal, as EDF falls, so too do the model's t statistics.)  Similarly, EDF also finds its way into the F statistics used for making inferences involving linear combinations of parameters.  (So, the lower is EDF, the smaller the F statistics will be, everything else equal.)  And so on.
>
> This argument does not apply to all diagnostic statistics (the unadjusted R-squared comes to mind).  But it is true for most of them.
>
> Paul
>
> P.S.: One of the regressions I ran using code of the form I shared with this list had EDF equal to 13800 - 2500 = 11300 (in round numbers): the ratio obs/coeffs was roughly 5.5.  And my t statistics and F statistics punished me for it to an extent.  But as long as I proceed with a full understanding of all of the above, there is no obvious reason -not- to perform the analysis, assuming I have theoretical reasons for specifying the model in this way.  Saying so simply acknowledges the applied statistician's dilemma: to make the most of limited resources.
>
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On Behalf Of [email protected]
> Sent: Wednesday, February 24, 2010 3:43 PM
> To: [email protected]
> Subject: [SUSPECT] Re: st: RE: regression r(103): too many variables
> Importance: Low
>
> Now that you've figured out what caused the error message, perhaps you
> should reconsider your proposed analysis.  You have too few
> observations to fit 2500 predictors.The rule of thumb, I believe, is
> that the ratio of observations to coefficients should be greater than
> 10:1.
>
> Steve
>
> On Wed, Feb 24, 2010 at 8:01 AM, Paul Higgins <[email protected]> wrote:
>> Hi all,
>>
>> Thanks for all of your suggestions: they were a big help.  My code contained an error that is probably a classic newbie misstep: misusing hyphens when making lists of variables.  The rhs of my regression contained thousands of interactions between sets of dummy variables (96 dummies representing quarter-hour time increments interacted with 22 date values of special import for the problem I was investigating, yielding a total of 2112 altogether just for that one pair of variables).  To construct these, I used code of the following form:
>>
>> /*****************************/
>> /* generate separate dummies */
>> /* for each event date       */
>> /*****************************/
>>
>> #delimit ;
>> local eventdates "mdy(1,13,2009)  mdy(2,20,2009)  mdy(3,27,2009)
>>                  mdy(4,10,2009)  mdy(4,17,2009)  mdy(5,18,2009)
>>                  mdy(5,23,2009)  mdy(5,24,2009)  mdy(6,30,2009)
>>                  mdy(7,1,2009)   mdy(7,9,2009)   mdy(8,14,2009)
>>                  mdy(8,15,2009)  mdy(9,16,2009)  mdy(9,18,2009)
>>                  mdy(9,19,2009)  mdy(10,3,2009)  mdy(11,2,2009)
>>                  mdy(11,3,2009)  mdy(12,7,2009)  mdy(12,8,2009)
>>                  mdy(12,9,2009)";
>> #delimit cr
>> local c = 1
>> foreach x of local eventdates {
>>      gen byte dum_`c' = (dt==`x')
>>      local c = `c' + 1
>>      }
>>
>> /************************************/
>> /* interact each event date dummy w/*/
>> /* each quarter-hour interval dummy */
>> /************************************/
>>
>> forvalues x = 1/96 {
>>        forvalues y = 1/22 {
>>                gen byte dum_`y'_int_`x' = dum_`y'*int_`x'
>>                }
>>        }
>>
>> Due to the order I used to nest the two loops, the variables weren't created in the same sequence as that assumed by my hyphenated lists in my regress statement.  I am a recent arrival in Stata-world (having been born in SAS-land, and having emigrated here via several other intermediate stops along the way), and in most other stats programs I've worked with, a single hyphen in a list of this type (i.e., dum_1_int_1-dum_1_int_96) would be expanded out in logical sequential fashion (i.e., dum_1_int_1 dum_1_int_2 ...).  But Stata expanded it out in the physical order in which the variables appeared in the data set (i.e., dum_1_int_1 dum_2_int_1 ...).  Thus, my regressions contained far more than 2500 rhs variables -- mostly redundant ones!  Once I replaced the hyphenated lists in the regress statement with wild-card versions (e.g., dum_1_int_*), all was well.
>>
>> Thanks again for your assitance.
>>
>> Paul H.
>>
>> -----Original Message-----
>> From: [email protected] [mailto:[email protected]] On Behalf Of Martin Weiss
>> Sent: Wednesday, February 24, 2010 1:59 AM
>> To: [email protected]
>> Subject: AW: st: RE: regression r(103): too many variables
>>
>>
>> <>
>>
>> Andi may want to use
>>
>>
>> *************
>> des, short
>> *************
>>
>> to prevent clutter on his screen.
>>
>>
>> HTH
>> Martin
>>
>> -----Ursprüngliche Nachricht-----
>> Von: [email protected]
>> [mailto:[email protected]] Im Auftrag von
>> [email protected]
>> Gesendet: Mittwoch, 24. Februar 2010 06:13
>> An: [email protected]
>> Betreff: Re: st: RE: regression r(103): too many variables
>>
>> Verify that you actually have 2500 variables, possibly by running
>> -des- on the variable list.
>>
>> Steve
>> --- Paul Higgins
>>> I am trying to use regress to run a linear regression.  The
>>> specification has a lot of rhs variables (around 2500), the
>>> majority of which are binary (0/1) variables.  <snip> I am
>>> getting r(103), "Too many variables specified".
>>
>>
>> On Tue, Feb 23, 2010 at 1:08 PM, Martin Weiss <[email protected]> wrote:
>>>
>>> <>
>>>
>>>
>>> This runs w/o a hitch in Stata 10.1 MP. Takes something like 2 minutes:
>>>
>>> *******
>>> clear*
>>> set mem 500m
>>> set obs 13700
>>>
>>> foreach var of newlist var1-var2500{
>>>                gen byte `var'=runiform()<.3
>>> }
>>>
>>> gen y=rnormal()
>>> reg y var1-var2500
>>> *******
>>>
>>>
>>> HTH
>>> Martin
>>>
>>>
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of Paul Higgins
>>> Sent: Dienstag, 23. Februar 2010 21:28
>>> To: '[email protected]'
>>> Subject: st: regression r(103): too many variables
>>>
>>> Hi all,
>>>
>>> I am trying to use regress to run a linear regression.  The specification
>>> has a lot of rhs variables (around 2500), the majority of which are binary
>>> (0/1) variables.  The data set contains about 13700 observations.  At the
>>> top of the .do file I set mem to 5 gigabytes, maxvar to 10000 and matsize
>> to
>>> 10000.  I'm using Stata / SE 10.1 for Windows, under Windows XP
>> Professional
>>> x64 edition version 5.2, on a machine that has 8 gigabytes of physical
>>> memory on-board.  I am getting r(103), "Too many variables specified".
>>  I've
>>> poked around the documentation, and I can see no mention of any internal
>>> limits to the regress command regarding number of variables.  Thus, I have
>>> assumed that only the general limits for Stata SE apply: maximum of 32767
>>> variables, maximum matsize of 11000.  But I appear to be wrong.
>>>
>>> Suggestions, please?
>>>
>>> PaulH
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>
>>
>>
>> --
>> Steven Samuels
>> [email protected]
>> 18 Cantine's Island
>> Saugerties NY 12477
>> USA
>> 845-246-0774
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>>
>
>
>
> --
> Steven Samuels
> [email protected]
> 18 Cantine's Island
> Saugerties NY 12477
> USA
> 845-246-0774
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/
>



-- 
Steven Samuels
[email protected]
18 Cantine's Island
Saugerties NY 12477
USA
845-246-0774

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

Follow-Ups:
- RE: Re: st: RE: regression r(103): too many variables
  - From: Paul Higgins <[email protected]>

References:
- st: regression r(103): too many variables
  - From: Paul Higgins <[email protected]>
- Re: st: RE: regression r(103): too many variables
  - From: [email protected]
- RE: st: RE: regression r(103): too many variables
  - From: Paul Higgins <[email protected]>
- Re: st: RE: regression r(103): too many variables
  - From: [email protected]
- RE: Re: st: RE: regression r(103): too many variables
  - From: Paul Higgins <[email protected]>

Prev by Date: Re: st: Urgent-Help
Next by Date: st: SPHDIST, creating pairs
Previous by thread: RE: Re: st: RE: regression r(103): too many variables
Next by thread: RE: Re: st: RE: regression r(103): too many variables
Index(es):
- Date
- Thread