Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: ...stepwise regression...


From   Ronán Conroy <rconroy@rcsi.ie>
To   "statalist hsphsun2.harvard.edu" <statalist@hsphsun2.harvard.edu>
Subject   Re: st: ...stepwise regression...
Date   Wed, 14 Apr 2004 12:02:34 +0100

on 13/04/2004 10:42, fabiopericolini@fastwebnet.it at
fabiopericolini@fastwebnet.it wrote:

> I have read the FAQ file on Stata Site. I share their (Harrel & Conroy)
> view on this point, but I have to set a interpretation model of a dipendent
> variable and I have more than 500 indipendent variables. I have select the
> more correlate variables but I need a procedure to select the most explanatory
> variables. 
> If I don't apply the sw linear regression I don't know how select the
> indipendent
> variables. I have to do.

If you have 500 predictor variables you are probably in a fishing
expedition. If so, you must think of some useful strategies for reducing
these to a list of candidates.

Two tips: the first is that the 500 variables may represent a smaller number
of underlying dimensions. Try principal components or factor analysis to see
if you can identify dimensions that underlie your predictors, and use these
dimensions to create scales that combine predictor variable that seem to
carry similar information. This approach is useful because it makes you
think of the conceptual dimensions that underlie your choice of the 500
predictor variables, and to combine predictors that seem logical and
appropriate to combine.

One interesting wrinkle on this method is to try -cluster- after doing your
factoring, to see if there are identifiable clusters of cases, suggesting
that the data are not smoothly distributed on the predictor variables, but
fall into constellations. Genetics people are fond of this approach, which
can be automated to a certain extent.

Second strategy is to look by brute force at every one of those univariate
regressions. Somebody has to, and I am reluctant to leave the job to a
machine which cannot read scatterplots. I like to have a look at splines as
well, to make sure the relationship isn't some odd shape. I would certainly
have a look at -rreg- if the number of cases is small  to see if influential
values are 'driving' the regression solution.

But there is no substitute for groping around and getting a feel for the
data. If there is something hidden in your 500 predictor variables, you are
the only person qualified to find it, because you understand the research
question that motivated the data collection. No statistical routine can do
that.  

Ronan M Conroy (rconroy@rcsi.ie)
Lecturer in Biostatistics
Royal College of Surgeons
Dublin 2, Ireland
+353 1 402 2431 (fax 2764)

--------------------
Just say no to drug reps
http://www.nofreelunch.org/

--------------------------------------------------------------------------------------------------------------------
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom
they are addressed.
If you have received this email in error please notify the
originator of the message. This footer also confirms that this
email message has been scanned for the presence of computer viruses.

Any views expressed in this message are those of the individual
sender, except where the sender specifies and with authority,
states them to be the views of The Royal College Of Surgeons in Ireland.

--------------------------------------------------------------------------------------------------------------------
*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index