Notice: On March 31, it was **announced** that Statalist is moving from an email list to a **forum**. The old list will shut down at the end of May, and its replacement, **statalist.org** is already up and running.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"JVerkuilen (Gmail)" <jvverkuilen@gmail.com> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: A layman question on model building |

Date |
Thu, 7 Mar 2013 10:31:52 -0500 |

On Thu, Mar 7, 2013 at 2:07 AM, James Bernard <jamesstatalist@gmail.com> wrote: > Hi all, > > I have a question which may sound too basic, but I wonder if anyone could help: > > We often add control variables that turn out to be insignificant. Does > that mean that I can remove that variable form my model without being > concerned with omitted variable bias? This is far from a basic question in my view. I'll take a look at John Antonakis' article as it looks interesting. What I tell students in class is something I guess I'd call semi-confirmatory model building. I've not really written this up so I'd love to hear feedback from the list. This isn't really a strict set of steps, but more of an organizational principle based on making conscious judgements during analysis based on explicitly formulated goals and minimizing post hoc adjustments. A good bit of this is out there of course. Mostly I'm trying to write it down. You could call it cognitive behavior therapy for statistical modeling, I guess, because it's very consistent with how CBT works. A. Choose the purpose of your modeling exercise. Decide what you intend to focus on. For instance, what are your three key points you'd like to be able to make when you are done? Unfortunately the statistics literature is highly confusing on this point, so many techniques that are really about pure prediction get a lot of attention even though often times that's not what students want to do. There's also a lot of causal language floating around that is misleading. Avoid mission creep as much as you can by having a mission. B. Choose your variables a priori as much as possible and, insofar as you can, keep a 1/3 randomly selected holdout sample. (This would also be true for other model decisions, such as model type but I'm largely assuming linear regression here.) C. Partition them into two sets: (1) Controls are the ones that "should" be in the model but which you don't really care about. (2) Substantive variables are the ones you do care about. You may have rank ordering within these sets and this exercise may be difficult, but it's important to take it as it's eliciting your preference ordering. D. Do an initial examination of the data for outliers and other gross problems using graphical tools such as QQ plots, scatterplots, loess lines, density estimates, etc. E. Fit models with Controls only and Controls+Substantive variables. Compare these using global measures such as R^2 change and on any local, more focused approaches that are relevant to your question. F. Run diagnostics and refine the model. Look for outliers, transformations, adding an additional variable, etc.. Minimize arbitrary changes such as dropping non-significant control variables just to save DF. If you need to make decisions such as dropping a block of controls, plan that in advance and use an appropriate test for that, such as an F test. Misspecification tests make sense at this point as well. G. Check the model on the holdout sample. (In some cases a different validation may be necessary or a real experimental replication may be possible, but the holdout sample is something you can control.) H. Communicate your walkaway points. * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/faqs/resources/statalist-faq/ * http://www.ats.ucla.edu/stat/stata/

**References**:**st: A layman question on model building***From:*James Bernard <jamesstatalist@gmail.com>

- Prev by Date:
**Re: st: Producing graph with predictions after IVPOIS** - Next by Date:
**st: Logit: (un)conditional fixed effect and clustering** - Previous by thread:
**Re: st: A layman question on model building** - Next by thread:
**st: clustering and log likelihood.** - Index(es):