| Title | Stata 5: Obtaining predicted probabilities after probit | |
| Author | William Sribney, StataCorp |
The programming techniques used in this answer are very simple in the beginning and very advanced at the end. The simple techniques will work fine, so don't think you must master the advanced ones.
Consider the following probit:
. probit y gender age value, nolog
Probit Estimates Number of obs = 16
chi2(3) = 2.23
Prob > chi2 = 0.5253
Log Likelihood = -9.4680749 Pseudo R2 = 0.1055
------------------------------------------------------------------------------
y | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------+--------------------------------------------------------------------
gender | .7505686 1.118497 0.671 0.502 -1.441645 2.942782
age | -.0440944 .0937161 -0.471 0.638 -.2277746 .1395857
value | 2.787841 2.10917 1.322 0.186 -1.346057 6.921739
_cons | -8.506711 5.827864 -1.460 0.144 -19.92912 2.915692
------------------------------------------------------------------------------
The predicted probabilities are given by the formula
pi = F(xi'*beta)
where F is the cumulative normal distribution, xi is the data vector for the i-th observation, and beta is the vector of coefficient estimates.
For this example,
xi = (gender[i], age[i], value[i], 1)
and
beta = (_b[gender], _b[age], _b[value], _b[_cons]).
You can use the _b[gender] syntax to access the coefficients:
. display _b[gender] .75056856
(See [U] 20.5 Accessing coefficients and standard errors and [R] matrix get, for details on accessing coefficients after an estimation command.)
The predicted probabilities can be computed by
. gen phat1 = normprob(_b[gender]*gender + _b[age]*age + _b[value]*value > + _b[_cons])
but doing it this way is unnecessary. The predict command will do it for you:
. predict phat2
. list phat1 phat2
phat1 phat2
1. .1077763 .1077763
2. .1343292 .1343292
3. .1866688 .1866688
4. .198736 .198736
5. .2112619 .2112619
6. .2376565 .2376565
7. .3687274 .3687274
8. .3837354 .3837354
9. .4152546 .4152546
10. .4171898 .4171898
11. .484975 .484975
12. .5157192 .5157192
13. .5828102 .5828102
14. .5854243 .5854243
15. .6044933 .6044933
16. .619428 .619428
Often one wants to evaluate predicted probabilities at the mean of x:
mean of x = (mean of gender, mean of age, mean of value)
This can be done by
. su gender age value
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
gender | 16 .125 .341565 0 1
age | 16 24.125 4.964877 17 35
value | 16 3.2725 .2139315 3.05 3.7
. display normprob(_b[gender]*0.125 + _b[age]*24.125 + _b[value]*3.2725
> + _b[_cons])
.36187213
This is very cumbersome! There is an easier way: use predict to compute the predicted index, take its mean, and take the normprob() of it:
. predict ihat1, index
. su ihat1
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
ihat1 | 16 -.3534592 .514132 -1.238441 .3039789
. display _result(3)
-.35345925
. display normprob(_result(3))
.36187209
This is what we got before (to within float precision).
Explanation: The index for the i-th observation is xi'*beta. Thus
. gen ihat2 = _b[gender]*gender + _b[age]*age + _b[value]*value + _b[_cons]
also gives the index.
. list ihat1 ihat2
ihat1 ihat2
1. -1.238441 -1.238441
2. -1.106158 -1.106158
3. -.8902391 -.8902391
4. -.8461447 -.8461447
5. -.8020502 -.8020502
6. -.7138613 -.7138613
7. -.3352256 -.3352256
8. -.2956849 -.2956849
9. -.2140487 -.2140487
10. -.2090879 -.2090879
11. -.0376709 -.0376709
12. .0394123 .0394123
13. .2090879 .2090879
14. .2157901 .2157901
15. .2649949 .2649949
16. .3039789 .3039789
The key to our shortcut is to note that
mean of ihat1 = _b[gender]*(mean of gender) + _b[age]*(mean of age) + _b[value]*(mean of value) + _b[_cons]
One can do what we did initially
. display normprob(_b[gender]*0.125 + _b[age]*24.125 + _b[value]*3.2725 > + _b[_cons])
and just substitute in different values for x = (gender, age, value).
This would be a big pain for a model with lots of independent variables.
In this case, it would be easier to use Stata’s matrix language:
. gen cons = 1
. matrix vecaccum mean = cons gender age value [iw=1/_N]
. matrix list mean
mean[1,4]
gender age value _cons
cons .125 24.125 3.2725 1
CAUTION: Make sure the order of the variables is the same here as
it is in the probit output.
. su gender age value
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
gender | 16 .125 .341565 0 1
age | 16 24.125 4.964877 17 35
value | 16 3.2725 .2139315 3.05 3.7
. matrix b = get(_b)
. matrix list b
b[1,4]
gender age value _cons
y1 .75056856 -.04409444 2.787841 -8.5067114
. matrix index = mean*b'
. matrix list index
symmetric index[1,1]
y1
cons -.35345924
. display normprob(index[1,1])
.36187209
. matrix x = mean
. matrix x[1,1] = 0
. matrix list x
x[1,4]
gender age value _cons
cons 0 24.125 3.2725 1
. matrix index = x*b'
. display normprob(index[1,1])
.32733634
Here’s a fancier example:
. su gender age value if gender==0
Variable | Obs Mean Std. Dev. Min Max
---------+-----------------------------------------------------
gender | 14 0 0 0 0
age | 14 23.57143 5.079586 17 35
value | 14 3.279286 .2270366 3.05 3.7
. matrix vecaccum x = cons gender age value [iw=1/14] if gender==0
. matrix list x
x[1,4]
gender age value _cons
cons 0 23.571429 3.2792857 1
. matrix index = x*b'
. display normprob(index[1,1])
.34312349
See the following sections of the Stata 5.0 Documentation:
[U] 20.5 Accessing coefficients and standard errors, [R] logit, [R] matrix accum, [R] matrix define, [R] matrix get, [R] predict, [R] summarize