Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

From |
"Marnix Zoutenbier" <Marnix.Zoutenbier@cqm.nl> |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: Predict in version 11 |

Date |
Fri, 10 Dec 2010 11:40:51 +0100 |

Dear Jef, Nick, Neil, ** Short reply: Thank you very much for your help with respect to -predict- after -anova- when values of x in the testset are outside the domain of the trainingset. I understand the way Stata 11 works and why this is chosen to be different from stata 10. ** Some extra background for those who are interested In our project we were dealing with a testset of 500k observations and a testset of 50k observations from which the measurements were hidden to us. Our model consisted of many different categorical regressors with some of them 10-20 categories and a model which also inclcuded 2-,3-, and 4-factor interactions. We assumed, based on our experience with Stata 10, that combinations of the regressors in the testset that were not in the trainingset were predicted with a missing value. The feedback we obtained in terms of overall RMSE in the testset was much worse than we expected based on the trainingset-results. The reason why is now clear to us: -predict- predicts the basevalue if the combinations of regressors is not estimated in the trainingset, without us realizing that, and that increased the RMSE in the testset considerably. I am very happy we found out what the reason is and being able to fix it. Thank you very much for your help in this process, Best regards, Marnix ______________________ Drs. Marnix Zoutenbier MTD CIRM Senior Consultant T: +31 (0)40 750 23 25 F: +31 (0)40 750 16 99 E: zoutenbier@cqm.nl CQM B.V. PO Box 414, 5600 AK Eindhoven, The Netherlands Vonderweg 16, 5616 RM Eindhoven, The Netherlands KvK 17076484 I: www.cqm.nl From: jpitblado@stata.com (Jeff Pitblado, StataCorp LP) To: statalist@hsphsun2.harvard.edu Date: 08-12-2010 20:11 Subject: Re: st: Predict in version 11 Sent by: owner-statalist@hsphsun2.harvard.edu Marnix Zoutenbier <Marnix.Zoutenbier@cqm.nl> is using -predict- after -anova- and noticed that Stata 11 will now produce a non-missing value in out-of-sample observations where a factor variable takes on values not observed within the estimation sample: > I see a difference in the way predict works between Stata10 and 11. > > Consider the following example > x1 testset y > 1 1 12 > 2 1 13 > 3 1 14 > 4 2 . > > And the commands > anova y x1 if testset==1 > predict yhat > > The following is the result in version 11 > x1 testset y yhat > 1 1 12 12 > 2 1 13 13 > 3 1 14 14 > 4 2 . 12 > > While in version 10 the following dataset results > x1 testset y yhat > 1 1 12 12 > 2 1 13 13 > 3 1 14 14 > 4 2 . . > > I prefer the version 10 way-of-working, because it gives me the opportunity > to identify observations that are in the testset (testset==2) and not in > the trainingset (testset==1). > > Is it possible to obtain the same result in version 11 as in version 10, > other than switching with the version command before and after predict? > > Thank you for your consideration, Short reply: Except under version control, as noted above by Marnix, there is no option of -predict- to get it to behave like it did in Stata 10. As with out-of-sample predictions involving continuous predictors, Stata 11 relies on the data analyst to judge which predictions are meaningful or even valid. Both Neil Shephard <nshephard@gmail.com> and Nick Cox <n.j.cox@durham.ac.uk> point out that -predict- allows -if- and -in- restrictions, giving the data analyst the control to identify which observations to compute the predictions. Longer reply: Prior to Stata 11, -anova- and -manova- were the only estimation commands that possessed logic to handle categorical variables, but even they had some limitations we intended to address with the new factor variables notation. For example, controlling the base level and level restrictions were not allowed with -anova- and -manova- without generating modified copies of the factor variables. The new factor variables notation also replaced and expanded on the features of the -xi- prefix, which produced indicator variables for categorical variables and some two-way interactions. One of our goals for the new factor variables notation was to get all of Stata's official estimation commands to support categorical variables and their interactions consistently. Thus -anova- and -manova- were updated to possess the same features of their linear models counterparts, -regress- and -mvreg-. The new factor variables notation allows you to specify which levels to include in a model fit. Using Marnix's data, let's fit an ANOVA model where we only care about the effect of x1=1 compared to all the other levels. In Stata 11 we simply type ***** BEGIN: . anova y 1.x1 Number of obs = 3 R-squared = 0.7500 Root MSE = .707107 Adj R-squared = 0.5000 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 1.5 1 1.5 3.00 0.3333 | x1 | 1.5 1 1.5 3.00 0.3333 | Residual | .5 1 .5 -----------+---------------------------------------------------- Total | 2 2 1 . mat li e(b) e(b)[1,2] 1. x1 _cons y1 -1.5 13.5 ***** END: We see that -anova- used all observations where 'x1' and 'y' were not missing, fitting an intercept '_cons' and a coefficient on '1.x1'. '1.x1' is factor variables notation for an implied variable that indicates when 'x1' is equal to 1. Here are the linear predictions: ***** BEGIN: . predict yhat1 if e(sample) (option xb assumed; fitted values) (1 missing value generated) . list +---------------------------+ | x1 testset y yhat1 | |---------------------------| 1. | 1 1 12 12 | 2. | 2 1 13 13.5 | 3. | 3 1 14 13.5 | 4. | 4 2 . . | +---------------------------+ ***** END: Notice that -predict- treated levels 2 and 3 the same, so we get their average response back as the linear prediction. This is in accordance with a linear regression model with a single indicator variable that identifies when 'x1' is equal to 1. Here are the commands to reproduce the above using -regress-, but without factor variables notation: ***** BEGIN: . gen x1is1 = x1==1 . regress y x1is1 Source | SS df MS Number of obs = 3 -------------+------------------------------ F( 1, 1) = 3.00 Model | 1.5 1 1.5 Prob > F = 0.3333 Residual | .5 1 .5 R-squared = 0.7500 -------------+------------------------------ Adj R-squared = 0.5000 Total | 2 2 1 Root MSE = .70711 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1is1 | -1.5 .8660254 -1.73 0.333 -12.5039 9.503896 _cons | 13.5 .5 27.00 0.024 7.146898 19.8531 ------------------------------------------------------------------------------ . predict ryhat1 if e(sample) (option xb assumed; fitted values) (1 missing value generated) . list +--------------------------------------------+ | x1 testset y yhat1 x1is1 ryhat1 | |--------------------------------------------| 1. | 1 1 12 12 1 12 | 2. | 2 1 13 13.5 0 13.5 | 3. | 3 1 14 13.5 0 13.5 | 4. | 4 2 . . 0 . | +--------------------------------------------+ ***** END: Since we did not use factor variables notation, we can reproduce the result in Stata 10 or Stata 11; we can even use -anova- instead of -regress-. --Jeff --Ken jpitblado@stata.com khigbee@stata.com * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/ * * For searches and help try: * http://www.stata.com/help.cgi?search * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

**References**:**Re: st: Predict in version 11***From:*jpitblado@stata.com (Jeff Pitblado, StataCorp LP)

- Prev by Date:
**st: Margins after mi sqreg** - Next by Date:
**Re: st: Margins after mi sqreg** - Previous by thread:
**Re: st: Predict in version 11** - Next by thread:
**st: Testing equality of 2 coefficients after FE regression: How does Stata compute the Pooled SE and test statistic?** - Index(es):