Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

# st: RE: FW: help on variable selection problem

 From "Sarah Edgington" To Subject st: RE: FW: help on variable selection problem Date Fri, 10 Jun 2011 13:02:49 -0700

```.
This doesn't seem to me to be a problem from the standpoint of analysis,
just interpretation.  A large sample size means that the estimates of
coefficients are more precise than they would be with a small sample.  No
matter what your sample size, though, statistical significance isn't
equivalent to substantive significance.  My recommendation would be to
specify a model that makes sense theoretically and then look at the results.
What's "important" for discussion of the results will depend somewhat on the
research question but discussing what kind of effect variables of interest
have seems to me to be what's important regardless of sample size.
Statistical significance doesn't tell you whether an effect size is large
enough to be interesting.  It tells you whether a coefficient is estimated
precisely enough to be reasonably sure it isn't zero.  A precisely estimated
small effect is still a small effect.

-Sarah

-----Original Message-----
From: owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu] On Behalf Of Lachenbruch,
Peter
Sent: Friday, June 10, 2011 12:40 PM
To: 'statalist@hsphsun2.harvard.edu'
Subject: st: FW: help on variable selection problem

This is not especially a Stata question, but it is driven by an analysis
issue...

A student is trying to analyze data from a national survey (no weights
needed).  She has 26 variables plus 10 years of data.  There are about
1,000,000 observations.  With this many observations, everything is
significantly different from 0.  She's using mlogit (predicting medical care
expenses), so she'd like to cut down the number of 'important' predictors.
I have thought of several options: backward stepwise  (not available with
mlogit); look at effect size and insist it be larger than 0.05 - again not
available since there are four categories of the response variable; use a
Bonferroni inequality on the coefficients and insist on a low p-value to
begin with - e.g. try for a size of 0.01 adjusting for 25 tests, so p must
be less than 0.0004.  The issue seems to be the huge sample size pushing
everything to significance.
Does anybody have any ideas?

Tony

Peter A. Lachenbruch
Department of Public Health
Oregon State University
Corvallis, OR 97330
Phone: 541-737-3832
FAX: 541-737-4001

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```