Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# Re: st: Looping over variables in more than one group

 From Joerg Luedicke To statalist@hsphsun2.harvard.edu Subject Re: st: Looping over variables in more than one group Date Wed, 7 Mar 2012 09:02:15 -0800

```This is data mining and if you are interested in hypothesis testing
your p-values will be of no use. To give just a simple example, based
on your question: imagine you have 5 variables and each is supposed to
measure the same thing, say x. Now you run 5 regressions for each of
those variables and find that only one of them is "significant". Would
you then conclude that x has a "significant" effect on y?

Remember that the goal of statistical modeling is to provide useful
information that cannot be obtained otherwise (or only with much
higher costs). The goal is not to find p-values below some arbitrary
threshold.

J.

On Wed, Mar 7, 2012 at 8:31 AM, jaweria seth <jaweriaseth@gmail.com> wrote:
> Thanks J,
> You are correct. In theory, i expect a 'production/size' variable to
> significantly affect my dependent variable, however, I wanted to let
> the regression spit out which of the variables in that category are
> most significant (since they are somewhat similar). In that case, I am
> looking to the tstatistics of the independent variables in the model.
> Is that not correct?
>
>
>
> On Wed, Mar 7, 2012 at 10:00 AM, Joerg Luedicke
> <joerg.luedicke@gmail.com> wrote:
>> You should probably rather think about what covariates make the most
>> sense to include with respect to your theory and research question.
>> Digging up variables to cook up good looking p-values and then
>> interpreting these p-values in the usual way is a questionable
>> endeavor, to say the least. However, if you are rather interested in
>> something like a prediction model, and not in hypothesis testing, you
>> could just use straight data mining techniques right away, for example
>> boosted regression (-findit boost-).
>>
>> J.
>>
>> On Wed, Mar 7, 2012 at 7:12 AM, jaweria seth <jaweriaseth@gmail.com> wrote:
>>> Thanks Nick,
>>> I understand this would result in a large number of models..
>>> however, I wouldn't be combining variables of the same category/group,
>>> as this would bring up the issue of multicollinearity.
>>> for example, I know for sure I need to add one variable each from
>>> groups 1 and 2. group 1 contains variables that measure the
>>> size/production of a business, and I am wondering which of those
>>> variables would be most significant in a multi-variate model. I am
>>> looking at t-stats in the regression output: if even one of the
>>> variables included is not significant at the 10%, that model gets
>>> dropped..( and as im running the regressions manually, i find that the
>>> majority of the combos are not significant).
>>>
>>> Does this make sense? If so, how can I implement it?
>>> The way I am doing it right now: Holding one variable from group2
>>> constant and looping through group 1/size variables to find
>>> significance. however, this gets tricky when I try to include a third
>>> variable.
>>>
>>>
>>> Thanks,
>>>
>>> On Wed, Mar 7, 2012 at 2:34 AM, Nick Cox <njcoxstata@gmail.com> wrote:
>>>> Before you even think of how to implement this, do the combinatorics
>>>> of how many models this implies.
>>>>
>>>> So, for example,
>>>>
>>>> . di 30^4
>>>> 810000
>>>>
>>>> . di 5^4
>>>> 625
>>>>
>>>> Then bump up those numbers adding in the null choices, i.e. no
>>>> variable from each group, as well.
>>>>
>>>> So you would need not only to do the looping but to ponder what it
>>>> implies in terms of gathering results from thousands of models,
>>>> finding the "best", whatever that means, including the implications
>>>> for how you think about the resulting P-values, etc.
>>>>
>>>> Nick
>>>>
>>>> On Tue, Mar 6, 2012 at 10:01 PM, jaweria seth <jaweriaseth@gmail.com> wrote:
>>>>
>>>>> I would like to run regressions with up to 4 different variables. My
>>>>> variables are separated into 4 groups with 5-30 variables in each
>>>>> group. I would like to run regression combos of different variables to
>>>>> find the best model:
>>>>> How do I regress my y variable on 1 variable from group 1 and 1 from
>>>>> group 2 and loop through different combos of each?
>>>>> for ex:
>>>>> regress Yvariable Group1 Group2
>>>>>
>>>>> Then I would like to add a variable from group 3, and so on..
>>>> *
>>>> *   For searches and help try:
>>>> *   http://www.stata.com/help.cgi?search
>>>> *   http://www.stata.com/support/statalist/faq
>>>> *   http://www.ats.ucla.edu/stat/stata/
>>>
>>> *
>>> *   For searches and help try:
>>> *   http://www.stata.com/help.cgi?search
>>> *   http://www.stata.com/support/statalist/faq
>>> *   http://www.ats.ucla.edu/stat/stata/
>>
>> *
>> *   For searches and help try:
>> *   http://www.stata.com/help.cgi?search
>> *   http://www.stata.com/support/statalist/faq
>> *   http://www.ats.ucla.edu/stat/stata/
>
>
>
> --
> Jaweria Seth
>
> *
> *   For searches and help try:
> *   http://www.stata.com/help.cgi?search
> *   http://www.stata.com/support/statalist/faq
> *   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```