Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

st: prediction in binary choice model

From   Alexander Cavallo <>
Subject   st: prediction in binary choice model
Date   Thu, 18 Aug 2005 14:07:17 -0500

Dear StataList,

I have a question about prediction in the case of binary choice models.

Suppose I estimate a probit model:
        Ystar(i,j) = X(i,j)*BetaX  + W(i,j)*BetaW + Z(j)*Gamma + u(i,j)
        Y(i,j) = 1 if Ystar(i,j)>0 and 0 otherwise
        i indexes persons 
        j indexes countries
        X(i,j) and W(i,j) are characteristics of persons
        Z(j) are characteristics of countries
        u(i,j) are the error terms

I am interested in the simulated effects of a changes in X(i,j) and W(i,j) 
on the expected number of individuals with Yhat(i,j) = 1.

In particular, suppose W(i,j) is unchanged but in the simulation, X(i,j) = 
X(i,j) + W(i,j).

There are two ways to do the calculation.

Method 1.  Sum of predicted probabilities
Predict the new probabilities [Psim(i,j)] after changing X(i,j) as 
indicated.  Sum up the probabilities within country.

Method 2.  Sum of predicted count
Predict the new probabilities after changing X(i,j).  Then predict a new 
indicator variable if the simulated probability exceeds the threshold for 
county j.
        Ysim(i,j) = 1 if Psim(i,j) >= Cutoff(j)
The literature suggests that the threshold level is arbitrary [see 
Greene's textbook "Econometric Analysis" for a discussion on prediction in 
binary choice].  Suppose I use the naive threshold of 0.50. 

I find very different results using the two methods.  Here is a stylized 

Assume that there are 1000 observations and that baseline predicted 
probabilities are uniform on the interval [0.20, 0.70] and that the 
threshold for prediction is 0.50.  In this case there are 400 observations 
with predicted outcome 1.  Then the sum of predicted probabilities in the 
baseline case is given by 1000 times the integral from 0.20 to 0.70 of 
x*dx, which is 225.

Suppose that the change causes an increase in each predicted probability 
of 10 percentage points.  Then the new count of obs above the threshold is 
600, and the sum of predicted probabilities is 275.

Here are the results of the exercise
                        Count                   Sum of
                        of Obs          Predicted
                        With                    Probabilites
        Baseline        400                     225
        Simulated       600                     275
        Delta           +200                    +50
        % Delta +50%                    +22%

My questions are:
1.  What is the right way to form aggregate level predictions for baseline 
and simulated data?
2.  I could change the threshold Cutoff(j) so that the count of obs with 
P(i,j)>Cutoff(j) matches the sample count of 1s.  Is there a literature on 
the optimal threshold for prediction?

Thanks for your help!

--Alex Cavallo

Managing Consultant
Navigant Consulting, Inc.

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index