Statalist


[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: Find all subsets of variables


From   Nick Cox <n.j.cox@stata.com>
To   statalist@hsphsun2.harvard.edu, icanette@stata.com
Subject   Re: st: Find all subsets of variables
Date   Thu, 25 Sep 2008 10:38:56 -0500

I agree with Alan and also with Tony, and disagree with Scott.
How is that possible, when Scott suppports Tony?

Others in this thread have kindly recommended my -allpossible- and my -selectvars- from SSC. No one recommended my -tuples- from SSC as a lower-level tool, which I prefer, not that these programs tackle precisely the same problem.

The original stimulus for -allpossible- was the thesis problem of a Ph.D. student of mine, who was looking at the predictability of a ground-measured response from 6 LANDSAT spectral bands. Neighbouring bands not surprisingly are often highly correlated, and exploring the question thoroughly could be done by looking at all 2^6 - 1 subsets of predictors. That is 63, and manageable with the right tools.

This limit of 6 predictors in my problem explains the limit built in to -allpossible-, a program written for one project only.

In making the program public as something others might find useful too, I was very queasy given (1) the combinatorial explosion of possibilities and (2) the predilection of many to hope or believe that the best model can or should be found automatically. Although I dislike stepwise modelling for all the standard reasons, it seems to me that looking at all the possible models can be a reasonable thing to do in some problems.

The help file for -allpossible- carries this "Warning: This hot drink is hot" caveat:

"Naturally, this command does not purport to replace the detailed scrutiny of individual models or to offer an unproblematic way of finding "best" models. Its main use may lie in demonstrating that several models exist within many projects possessing roughly equal merit as measured by omnibus statistics."

When others asked similar questions I revisited the issue with -selectvars- and -tuples-.

Nick
n.j.cox@durham.ac.uk

Feiveson, Alan H.


One situation where you might want to consider all subsets (possibly of
a given size) is where you are trying to approximate a deterministic
function with as few terms as is "reasonable". In this case, there is no
"true" model or statistical inference to be made. For example, I may
have a table of values of predictors and a function of these predictors
obtained by some proprietary software and I am just trying to find a
cheap approximation to the function using a linear combination of a
small number of the predictors (or transformations of the predictors).
SR Millis

I agree.  I can't imagine why anyone would want to use all-subsets.
Bayesian model averaging may be another alternative worth considering.
Lachenbruch, Peter

I think the same problem
exists - you get a billion line output (with 50 vars and subset size of 10). I think SAS had something like this, but displayed only the 'best' one.

This suggests to me a) know a lot about your data before doing this; b) look for small subsets; or c) use some sort of stepwise (and penalized) procedure (AIC or BIC or Mallows' Cp).

We're talking the art of statistical analysis now.
junin

i want to find out all subsets of a given set of
variables for model
testing. As an example:

A set of variables var1 var2 var3 var4 should give me:
var1 var2 var3 var4
var1 var2 var3
var1 var2 var4
var1 var3 var4
var1 var4
var1
var2 var3 var4

and so forth.

I would like to test all possible model
configurations. Is there a
command in Stata, which could be convenient to use?
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index