 Notice: On April 23, 2014, Statalist moved from an email list to a forum, based at statalist.org.

Re: Re: Re: st: Fitting probit - estat gof puzzling results

 From Clyde B Schechter To "statalist@hsphsun2.harvard.edu" Subject Re: Re: Re: st: Fitting probit - estat gof puzzling results Date Fri, 2 Sep 2011 18:51:36 +0000

As part of a long-running thread, Jesus Gonzalez wrote:

"You said models like this (that discriminate quite well but are not well calibrated), "may still be useful for understanding factors that promote or inhibit applying, even if they are not well calibrated--but such models would not be suitable for some other purposes". My questions are two-fold:
1) If the data generating process does not match with the model I am using, then the assumptions about variable's distribution might not hold (for example probability distribution of residuals). Therefore, every statistic that depends on such assumptions might also be unreliable, and hence, the simplest hypothesis testing might be untrustworthy (for example, if one variable coefficient is statistically different from 0).  Am I right? If I am right why and/or how the model "may still be useful for understanding factors that promote or inhibit applying"?. ....

2) What are the kind of purposes for which such a model might not be suitable? For example, an out-of-sample prediction of the probability of applying given the RHS variables?"

The type of hypothesis testing we do with probit and similar models is, in fact, conditional on the model being properly specified!  The estimating procedures give us the values of the model's parameters that, in some sense (ml, OLS, whatever) best fit the data.  But even the glove that fits your foot better than any other glove won't work very well as a sock.  For some models there is a well-developed science of robustness to violation of assumptions.  I'm not an expert in that area and won't really go further in this direction, because my comment really was intended to refer to something different:

There is more to life than hypothesis testing!  Sometimes what we are really after is getting accurate predictions.  Sometimes what we are really interested in is a semi-quantitative or qualitative understanding of relationships among observables.

Try running this example:

// EXAMPLE FOR J GONZALEZ
clear*
set obs 1000
set seed 32673099

// POPULATE X WITH VALUES BETWEEN 0 AND 10
// CREATE A DICHOTOMOUS OUTCOME Y WHOSE PROBABILITY IS
// CORRECTLY MODELED BY A LOGISTIC RELATIONSHIP TO LOG X
gen x = (_n - 1)/100
gen lx = log(x)
gen p_true = invlogit(lx)
gen y = (runiform() < p_true)

// A GRAPH TO SHOW THAT A SIMPLE LOGISTIC REGRESSION OF
// Y ON X WOULD BE SERIOUSLY MIS-SPECIFIED
lowess y x, logit

// FIT A LOGISTIC REGRESSION MODEL OF Y TO LOG X
logit y lx
// IT HAS GOOD CALIBRATION AND DISCRIMINATION
// (AND IT IS, BY CONSTRUCTION, THE CORRECT MODEL)
estat gof, group(10) table
lroc, nograph

// FIT A LOGISTIC REGRESSION MODEL OF Y TO X
// IT HAS POOR CALIBRATION, ESPECIALLY FOR LOW
// VALUES OF PREDICTED PROBABILITY
// BUT IT'S DISCRIMINATION IS ESSENTIALLY THE SAME
// AS THAT OF THE CORRECT MODEL
logit y x
estat gof, group(10) table
lroc, nograph

// END OF EXAMPLE

The model based on x is mis-specified and poorly calibrated, but it discriminates the outcome y just as well as the true model based on log x.  While inferences based on the coefficients in the x-model would be misleading, this incorrect model is nevertheless useful in one way: it correctly identifies x as an important determinant of y.  It gets the quantitative relationship wrong, but captures the qualitative fact that y and x are strongly associated with each other: higher values of x are associated with greater probabilities of y = 1.  If that determination were our major objective, this model would be fine.  But if the outcome y were, say, an insurable event, you would be in trouble if you relied on this model to set premiums because the predicted probabilities are not close to the actual probabilities.  In fact, for premium-setting, a better model might well be to just use the marginal probability of y: this ultra-simple model would have no discrimination at all (area u!
nder ROC = 0.5) but would be perfect for that purpose.  Similar considerations would apply if the purpose of the model is to budget and plan for dealing with occurrences of event y.

This example is obviously contrived, and the nature of the mis-specification could be easily discovered during analysis--but it is, I think, a clearer example of the idea than could be found in real-world data.  And although the example is a bit of a caricature, these same phenomena arise in real work.  Our handy-dandy logistic, probit, and other models, applied to variables selected because they are what we happen to be able to measure, etc., are almost always mis-specifications of the real data generating process.  But depending on the nature and extent of the mis-specification they can be useful for various purposes.  The key thing to remember is that a model that is useful for one purpose may be useless or even dangerous if used for other purposes.  When developing and using models it is always important to bear that in mind and determine their suitability for the purpose at hand.

Hope this makes it clearer.

Clyde Schechter
Dept. of Family & Social Medicine
Albert Einstein College of Medicine
Bronx, NY, USA

*
*   For searches and help try:
*   http://www.stata.com/help.cgi?search
*   http://www.stata.com/support/statalist/faq
*   http://www.ats.ucla.edu/stat/stata/