Stata: Data Analysis and Statistical Software

Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: st: Predict in version 11

From	Nick Cox <[email protected]>
To	"'[email protected]'" <[email protected]>
Subject	RE: st: Predict in version 11
Date	Wed, 8 Dec 2010 16:11:11 +0000

OK. Looking at your example again, I think I can see your point. Why does -predict- think it can go ahead when x1 == 4? The key point is that -x1- is treated as categorical, so there's no information on that category in the data used to fit. The use of a baseline category instead, if that is what happening, may be a fair default, but statistically it seems arbitrary. 

(I was tacitly thinking in regression terms and finding no difficulty in the idea of a prediction for values of the predictors that don't occur elsewhere. But this is ANOVA with a categorical predictor) 

Nick 
[email protected] 

Marnix Zoutenbier

Dear Nick,

Thank you for your response. However, your solution is not what I mean. I
want to predict forboth  testset==1 and testset==2, but I want Stata to
predict a missing value in the case that x1=4 in testset==2 because x1=4
does not appear in testset==1.

However, in version 11 Stata also predicts in testset==2 for values of x1
that do not appear in testset==1 (trainingset). Stata uses the constant to
predict, which I think, is very confusing in large datasets. In version 10,
Stata predicts a missing value in those cases, which is, in my opinion, the
proper way to proceed.


From:	Nick Cox <[email protected]>

The solution appears to be just a twist away from that already given.

... if testset == 1

Otherwise put, -predict- allows -if- (and -in-), so just specify whatever
restrictions you want.

Marnix Zoutenbier

Neil his reaction is correct. However, it shows that I did not formulate my
problem accurate, because it is not the solution that works for me.

Let me extend the example with one extra observation to make myself more
clear
x1		 testset 		 y
1		 1		 12
2		 1		 13
3		 1		 14
4		 2		 .
3		 2		 .

So the last observation is defined by x1 in the same way as the third
observation. The testset (testset==2) consists of 2 observations, from
which the observation with x1=3 can be predicted based on the traininset
(testset==1) but the observation with x1=4 can not be predicted because
x1=4 is not in the trainingset.

First in version 11
version 11
anova y x1 if testset==1
predict yhat

Gives the following result in version 11
x1		 testset 	 y		 yhat
1		 1		 12		 12
2		 1		 13		 13
3		 1		 14		 14
4		 2		 .		 12
3		 2		 .		 14

Now in version 10
version 10
anova y x1 if testset==1
predict yhat

Gives the following result
x1		 testset 	 y		 yhat
1		 1		 12		 12
2		 1		 13		 13
3		 1		 14		 14
4		 2		 .		 .
3		 2		 .		 14

This problem is not fixed with the 'e(sample)' suggestion, because I do
want to predict in the testset (outside e(sample)), however, I only want
predictions for values of x1 that are used in the trainingset (testset==1).

From:		 Neil Shephard <[email protected]>

On Wed, Dec 8, 2010 at 9:58 AM, Marnix Zoutenbier

> I see a difference in the way predict works between Stata10 and 11.
>
> Consider the following example
> x1      testset         y
> 1       1       12
> 2       1       13
> 3       1       14
> 4       2       .
>
> And the commands
> anova y x1 if testset==1
> predict yhat
>
> The following is the result in version 11
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       12
>
> While in version 10 the following dataset results
> x1      testset         y       yhat
> 1       1       12      12
> 2       1       13      13
> 3       1       14      14
> 4       2       .       .
>
> I prefer the version 10 way-of-working, because it gives me the
opportunity
> to identify observations that are in the testset (testset==2) and not in
> the trainingset (testset==1).
>
> Is it possible to obtain the same result in version 11 as in version 10,
> other than switching with the version command before and after predict?


Yes, see the -man predict- page
(http://www.stata.com/help.cgi?predict), items 6 and 7 in the
Description section near the top...

    predict can be used to make in-sample or out-of-sample predictions:

        6.  predict calculates the requested statistic for all
possible observations, whether they were used in fitting the model or
not.  predict does this for the standard options 1 through 3 and
            generally does this for estimator-specific options 4.

        7.  predict newvar if e(sample), ...  restricts the prediction
to the estimation subsample.


So in your above example under Stata 11 you should use...

predict yhat if(e(sample))

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/

References:
- st: Predict in version 11
  - From: "Marnix Zoutenbier" <[email protected]>
- Re: st: Predict in version 11
  - From: Neil Shephard <[email protected]>
- Re: st: Predict in version 11
  - From: "Marnix Zoutenbier" <[email protected]>
- RE: st: Predict in version 11
  - From: Nick Cox <[email protected]>
- RE: st: Predict in version 11
  - From: "Marnix Zoutenbier" <[email protected]>

Prev by Date: RE: st: Predict in version 11
Next by Date: st: If Stata drops regressors due to Colinearity: Has it automatically picked the right regressors then?
Previous by thread: RE: st: Predict in version 11
Next by thread: Re: st: Predict in version 11
Index(es):
- Date
- Thread