jaweria seth <[email protected]>

[email protected]

Re: st: Looping over variables in more than one group

Wed, 7 Mar 2012 12:08:36 -0600

Thanks guys, I'm not sure where to go from here. I've tried many methods with this regression model. Here's what I am trying to accomplish: I have over 80 variables available to me, and as they are finanacial metrics, most are highly and significantly correlated with one another. I am trying to build a multi-variate linear regression model that best predicts my Y variable (profit/ returns/ etc..)... Thinking about this intuitively, the majority of the variables should work, but my issue is: how do I choose which variables to include? Any help would be appreciated, Thanks, j.seth On Wed, Mar 7, 2012 at 11:02 AM, Joerg Luedicke <[email protected]> wrote: > This is data mining and if you are interested in hypothesis testing > your p-values will be of no use. To give just a simple example, based > on your question: imagine you have 5 variables and each is supposed to > measure the same thing, say x. Now you run 5 regressions for each of > those variables and find that only one of them is "significant". Would > you then conclude that x has a "significant" effect on y? > > Remember that the goal of statistical modeling is to provide useful > information that cannot be obtained otherwise (or only with much > higher costs). The goal is not to find p-values below some arbitrary > threshold. > > J. > > On Wed, Mar 7, 2012 at 8:31 AM, jaweria seth <[email protected]> wrote: >> Thanks J, >> You are correct. In theory, i expect a 'production/size' variable to >> significantly affect my dependent variable, however, I wanted to let >> the regression spit out which of the variables in that category are >> most significant (since they are somewhat similar). In that case, I am >> looking to the tstatistics of the independent variables in the model. >> Is that not correct? >> >> >> >> On Wed, Mar 7, 2012 at 10:00 AM, Joerg Luedicke >> <[email protected]> wrote: >>> You should probably rather think about what covariates make the most >>> sense to include with respect to your theory and research question. >>> Digging up variables to cook up good looking p-values and then >>> interpreting these p-values in the usual way is a questionable >>> endeavor, to say the least. However, if you are rather interested in >>> something like a prediction model, and not in hypothesis testing, you >>> could just use straight data mining techniques right away, for example >>> boosted regression (-findit boost-). >>> >>> J. >>> >>> On Wed, Mar 7, 2012 at 7:12 AM, jaweria seth <[email protected]> wrote: >>>> Thanks Nick, >>>> I understand this would result in a large number of models.. >>>> however, I wouldn't be combining variables of the same category/group, >>>> as this would bring up the issue of multicollinearity. >>>> for example, I know for sure I need to add one variable each from >>>> groups 1 and 2. group 1 contains variables that measure the >>>> size/production of a business, and I am wondering which of those >>>> variables would be most significant in a multi-variate model. I am >>>> looking at t-stats in the regression output: if even one of the >>>> variables included is not significant at the 10%, that model gets >>>> dropped..( and as im running the regressions manually, i find that the >>>> majority of the combos are not significant). >>>> >>>> Does this make sense? If so, how can I implement it? >>>> The way I am doing it right now: Holding one variable from group2 >>>> constant and looping through group 1/size variables to find >>>> significance. however, this gets tricky when I try to include a third >>>> variable. >>>> >>>> >>>> Thanks, >>>> >>>> On Wed, Mar 7, 2012 at 2:34 AM, Nick Cox <[email protected]> wrote: >>>>> Before you even think of how to implement this, do the combinatorics >>>>> of how many models this implies. >>>>> >>>>> So, for example, >>>>> >>>>> . di 30^4 >>>>> 810000 >>>>> >>>>> . di 5^4 >>>>> 625 >>>>> >>>>> Then bump up those numbers adding in the null choices, i.e. no >>>>> variable from each group, as well. >>>>> >>>>> So you would need not only to do the looping but to ponder what it >>>>> implies in terms of gathering results from thousands of models, >>>>> finding the "best", whatever that means, including the implications >>>>> for how you think about the resulting P-values, etc. >>>>> >>>>> Nick >>>>> >>>>> On Tue, Mar 6, 2012 at 10:01 PM, jaweria seth <[email protected]> wrote: >>>>> >>>>>> I would like to run regressions with up to 4 different variables. My >>>>>> variables are separated into 4 groups with 5-30 variables in each >>>>>> group. I would like to run regression combos of different variables to
find the best model:
How do I regress my y variable on 1 variable from group 1 and 1 from
group 2 and loop through different combos of each?
for ex:
regress Yvariable Group1 Group2

Then I would like to add a variable from group 3, and so on..

