# st: prediction in binary choice model

 From Alexander Cavallo To statalist@hsphsun2.harvard.edu Subject st: prediction in binary choice model Date Thu, 18 Aug 2005 14:07:17 -0500

```Dear StataList,

I have a question about prediction in the case of binary choice models.

Suppose I estimate a probit model:
Ystar(i,j) = X(i,j)*BetaX  + W(i,j)*BetaW + Z(j)*Gamma + u(i,j)
Y(i,j) = 1 if Ystar(i,j)>0 and 0 otherwise
where
i indexes persons
j indexes countries
X(i,j) and W(i,j) are characteristics of persons
Z(j) are characteristics of countries
u(i,j) are the error terms

I am interested in the simulated effects of a changes in X(i,j) and W(i,j)
on the expected number of individuals with Yhat(i,j) = 1.

In particular, suppose W(i,j) is unchanged but in the simulation, X(i,j) =
X(i,j) + W(i,j).

There are two ways to do the calculation.

Method 1.  Sum of predicted probabilities
Predict the new probabilities [Psim(i,j)] after changing X(i,j) as
indicated.  Sum up the probabilities within country.

Method 2.  Sum of predicted count
Predict the new probabilities after changing X(i,j).  Then predict a new
indicator variable if the simulated probability exceeds the threshold for
county j.
Ysim(i,j) = 1 if Psim(i,j) >= Cutoff(j)
The literature suggests that the threshold level is arbitrary [see
Greene's textbook "Econometric Analysis" for a discussion on prediction in
binary choice].  Suppose I use the naive threshold of 0.50.

I find very different results using the two methods.  Here is a stylized
example.

Assume that there are 1000 observations and that baseline predicted
probabilities are uniform on the interval [0.20, 0.70] and that the
threshold for prediction is 0.50.  In this case there are 400 observations
with predicted outcome 1.  Then the sum of predicted probabilities in the
baseline case is given by 1000 times the integral from 0.20 to 0.70 of
x*dx, which is 225.

Suppose that the change causes an increase in each predicted probability
of 10 percentage points.  Then the new count of obs above the threshold is
600, and the sum of predicted probabilities is 275.

Here are the results of the exercise
Count                   Sum of
of Obs          Predicted
With                    Probabilites
P(i,j)>0.50
Baseline        400                     225
Simulated       600                     275
Delta           +200                    +50
% Delta +50%                    +22%

My questions are:
1.  What is the right way to form aggregate level predictions for baseline
and simulated data?
2.  I could change the threshold Cutoff(j) so that the count of obs with
P(i,j)>Cutoff(j) matches the sample count of 1s.  Is there a literature on
the optimal threshold for prediction?

--Alex Cavallo

Managing Consultant
Navigant Consulting, Inc.

*
*   For searches and help try:
*   http://www.stata.com/support/faqs/res/findit.html
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/
```