Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: st: A layman question on model building

From	"JVerkuilen (Gmail)" <[email protected]>
To	[email protected]
Subject	Re: st: A layman question on model building
Date	Thu, 7 Mar 2013 10:31:52 -0500

On Thu, Mar 7, 2013 at 2:07 AM, James Bernard <[email protected]> wrote:
> Hi all,
>
> I have a question which may sound too basic, but I wonder if anyone could help:
>
> We often add control variables that turn out to be insignificant. Does
> that mean that I can remove that variable form my model without being
> concerned with omitted variable bias?

This is far from a basic question in my view. I'll take a look at John
Antonakis' article as it looks interesting.

What I tell students in class is something I guess I'd call
semi-confirmatory model building. I've not really written this up so
I'd love to hear feedback from the list. This isn't really a strict
set of steps, but more of an organizational principle based on making
conscious judgements during analysis based on explicitly formulated
goals and minimizing post hoc adjustments. A good bit of this is out
there of course. Mostly I'm trying to write it down. You could call it
cognitive behavior therapy for statistical modeling, I guess, because
it's very consistent with how CBT works.

A. Choose the purpose of your modeling exercise. Decide what you
intend to focus on. For instance, what are your three key points you'd
like to be able to make when you are done? Unfortunately the
statistics literature is highly confusing on this point, so many
techniques that are really about pure prediction get a lot of
attention even though often times that's not what students want to do.
There's also a lot of causal language floating around that is
misleading. Avoid mission creep as much as you can by having a
mission.

B. Choose your variables a priori as much as possible and, insofar as
you can, keep a 1/3 randomly selected holdout sample. (This would also
be true for other model decisions, such as model type but I'm largely
assuming linear regression here.)

C. Partition them into two sets: (1) Controls are the ones that
"should" be in the model but which you don't really care about. (2)
Substantive variables are the ones you do care about. You may have
rank ordering within these sets and this exercise may be difficult,
but it's important to take it as it's eliciting your preference
ordering.

D. Do an initial examination of the data for outliers and other gross
problems using graphical tools such as QQ plots, scatterplots, loess
lines, density estimates, etc.

E. Fit models with Controls only and Controls+Substantive variables.
Compare these using global measures such as R^2 change and on any
local, more focused approaches that are relevant to your question.

F. Run diagnostics and refine the model. Look for outliers,
transformations, adding an additional variable, etc.. Minimize
arbitrary changes such as dropping non-significant control variables
just to save DF. If you need to make decisions such as dropping a
block of controls, plan that in advance and use an appropriate test
for that, such as an F test. Misspecification tests make sense at this
point as well.

G. Check the model on the holdout sample. (In some cases a different
validation may be necessary or a real experimental replication may be
possible, but the holdout sample is something you can control.)

H. Communicate your walkaway points.
*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/faqs/resources/statalist-faq/
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: A layman question on model building
  - From: James Bernard <[email protected]>

Prev by Date: Re: st: Producing graph with predictions after IVPOIS
Next by Date: st: Logit: (un)conditional fixed effect and clustering
Previous by thread: Re: st: A layman question on model building
Next by thread: st: clustering and log likelihood.
Index(es):
- Date
- Thread