Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: RE: Signficance vs prediction


From   "Naji Nassar \(MIReS\)" <naji.nassar@mires.fr>
To   <statalist@hsphsun2.harvard.edu>
Subject   st: RE: Signficance vs prediction
Date   Wed, 10 Mar 2004 13:35:29 +0100

David,

I'm used to compare between strikes getting better RMSE and costs of
variables (no costs for cognition)

Ideas
- Test both models on data which have'nt beeb used for model estimation ..>
RMSE
- Robust regression : excluding some extreme values (can it be explained)
- Absolute deviance rather than squares..

Best
Naji
-----Message d'origine-----
De : owner-statalist@hsphsun2.harvard.edu
[mailto:owner-statalist@hsphsun2.harvard.edu]De la part de David Vaughan
Envoye : mercredi 10 mars 2004 08:11
A : statalist@hsphsun2.harvard.edu
Objet : st: Signficance vs prediction


I know this is pretty simple but the answer is not obvious in my old
texts and in business I have no expert colleague to whom to turn.
My purpose is to construct a model which will be used for best-possible
prediction from new input data.
I constructed a regression model, based on historical understanding of
the domain, using eight predictors and obtained the following data
about the model:
F(8,98) = 35.15
Adj R-squared = 0.7205
RMSE = 0.90373

I noted that three of the predictors had  P>| t |  around 0.2-0.24.
Eliminating those gave me model results:
F(5,101) = 54.49
Adj R2 = 0.7295
RMSE = 0.91067

So significance has gone up but so has error. I assume that the larger
model over-fits the data and, if I were arguing around causaility,
would prefer the more compact model. Yet, it seems that the larger
model just does a slightly better job of prediction. How do I think
about this? Generally, where do I stop in a predictive problem (there
are other inputs available)? Should I care that much about a minor RMSE
difference or just do a "judgement" check on error differences on new
data? I also did a decent (N=1000) bootstrap on the larger model and
confidence intervals around all the predictors appeared reasonable for
our purpose. Either of the above models serves better than our previous
approach although it seems (opinion) that the larger model does better
at the extremes.

Talking to myself, I wonder if I just need more data for analysis
(painful process) but is there a statistical approach to focussing on
that extreme-edge issue? Perhaps I should be looking for another
inflection point in the model - we have already found one at the other
end, which I omitted from the above for brevity. If so, how does one
find it other than by trial?

Any advice or reading directions welcome.

thanks
David

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/


*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/



© Copyright 1996–2014 StataCorp LP   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index