[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

From |
vwiggins@stata.com (Vince Wiggins, StataCorp) |

To |
statalist@hsphsun2.harvard.edu |

Subject |
Re: st: How much can we trust Stata's non-linear solver(s)? |

Date |
Thu, 28 Aug 2003 10:01:07 -0500 |

NOTE: We tried sending this post yesterday, but it turns out to be larger than the limits set for statalist's majordomo. We are sending it today in two pieces. This is the second piece. If you have not yet read or received the first piece, wait and read it first. +------+ |Part 2| +------+ Continuation of post about Joachim Wagner <wagner@uni-lueneburg.de> questions regarding McCullough and Vinod's AER paper. --------------------------- Preface to long answer to 1 --------------------------- First, let me note that we do not believe that Stata was among the packages used in the article. The author's are understandably maintaining anonymity of the packages and that includes not divulging who is not in the study. They will, however, graciously supply replication datasets and the code used to produce the published results. From those we have coded a user-written maximum likelihood (-ml-) estimator for the custom estimator problem presented in the paper. Stata's behaviour does not match any of the reported behaviours, though it is closest to the reported "gold standard". From this we infer that Stata was probably not among the tested packages. ------------------------------------------ Longer answer to 1 -- the probit estimator ------------------------------------------ Madalla's data exhibits a very common "problem" with binary outcome data -- the outcome (d1) is always "success" (d1==1) whenever one of the indicator regressors (d2) is 0. This means that d2==0 perfectly predicts "success" (d1==1). To capture this, the probit estimator must drive the coefficient on d2 to infinity, or something large relative to the other coefficients, so as to override the effect of all the other regressors. Once the coefficient on d2 gets that large, the observation has no effect whatsoever on the likelihood and thus the coefficient becomes irrelevant. This can cause problems for maximization algorithms. Operationally, it is not as problematic as it sounds, but it is better to be safe and identify "perfect predictors" up front and remove them and their data from the estimation. Yes, removing them is the right thing to do. If the maximizer drives their coefficient to where it needs to be to maximize the likelihood those observations will have a probability of 1 and contribute nothing to the likelihood. Put another way, there is no statistical information, at least w.r.t. the likelihood, in these observations. Their contribution to the likelihood can be replaced by the simple rule, if d1==0 then d2=1. This is exactly what Stata does and has done for over 12 years. (To be complete, it is possible to set up permutation tests for coefficients of "perfect predictors", but this is different and outside the scope of maximum likelihood estimation.) Here are the results of running -probit- in Stata on the data from Madalla. ------------------------- Begin Stata probit results ------------------------ . probit d1 t y lf nw d2 note: d2 != 0 predicts success perfectly d2 dropped and 15 obs not used Iteration 0: log likelihood = -17.961912 Iteration 1: log likelihood = -12.039252 Iteration 2: log likelihood = -9.4820411 Iteration 3: log likelihood = -8.7547721 Iteration 4: log likelihood = -8.6442841 Iteration 5: log likelihood = -8.6403866 Iteration 6: log likelihood = -8.6403799 Probit estimates Number of obs = 29 LR chi2(4) = 18.64 Prob > chi2 = 0.0009 Log likelihood = -8.6403799 Pseudo R2 = 0.5190 ------------------------------------------------------------------------------ d1 | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- t | .0113124 .0056558 2.00 0.045 .0002272 .0223975 y | 6.461118 3.148348 2.05 0.040 .2904686 12.63177 lf | -.4092877 .2572157 -1.59 0.112 -.9134213 .0948459 nw | 42.49827 20.74973 2.05 0.041 1.829544 83.16699 _cons | 6.915517 11.3225 0.61 0.541 -15.27617 29.10721 ------------------------------------------------------------------------------ note: 0 failures and 1 success completely determined. ------------------------- End Stata probit results -------------------------- Note particularly the first message after the -probit- command. To be clear, including those 15 observations and having a large coefficient on d2 would produce EXACTLY the same likelihood and estimates of the other coefficients. It is just more numerically stable to exclude the d2 regressor and 15 perfectly predicted observations, and that is what Stata does. The short story is -- Stata takes care of almost all problems in the data for all official estimators. As a side note, we would like to at least partially defend the "inaccurate" results of the other estimation packages that did not always exclude the d2 regressor and perfectly predicted observations. We have already noted that the coefficient for d2 cannot be estimated by ML so what is important statistically is that (1) the other coefficients and their SEs are estimated well and that (2) we are not misled as to the effect of d2. McCullough and Vinod report Stokes as saying that other than the regressor for d2 all of the packages agreed completely on the estimated coefficients and their standard errors, so item (1) is met -- our inferences about these coefficients are NOT affected by the whether d2 is included. They also report that the coefficient estimates for d2 are large (ranging from 4.4 to 8.1) and that the standard errors are extremely large (ranging from 46 to 114.550). Looking specifically at the results reported in Madalla's text, the coefficient estimate is 4.63 and the SE is about 115. Note that the SEs are so large that we would never infer that coefficient estimate is significantly different from 0, yet the coefficient estimates are VERY large. Why do we say 4.63 is large? Remember that with probit we are normalized to a standard normal distribution for the latent variable and 4.63 is far into the tail of the standard normal, so 4.63 is large. If we were to look carefully at Madalla's estimates we would see that the point estimate of d2 is so large as to dominate the other regressors, but that we have no confidence in that estimate. So, we are also not likely to make invalid inference about the "perfect predictor" variable, d2, and thus item (2) is also met. Even so, the estimator is more stable when handled the way Stata's -probit- estimator works, particularly if there are many "perfect predictor" variables. Scott Merryman <smerryman@kc.rr.com> posted much more information from Stokes paper. While Stokes acknowledges Stata identifies the problem, Scott notes that Stokes finds our correction for "perfect predictor"s and correct estimation of the model 'strange and confusing'. If you followed the argument above, you should not find it confusing at all. Quoting Stokes, > [...] > All 15 such cases were detected by Stata. What is strange is that if > Stata really believed its own message, it would have either reported > nothing (the best decision) or reported the subset model in equation > (2). Just not reporting the D2 coefficient but leaving all other > coefficients in the model is a strange and confusing choice. > [...] We think Stokes finds Stata's ability to estimate the model confusing because of the common misconception that the probit model cannot be estimated by ML on such data. It can. Well, let's qualify that a bit, if you believe that infinity should be excluded from the solution set for any ML parameter estimate then perhaps technically what -probit- produces is not exaclty an ML estimate. Be assured, however, that the coefficients reported by Stata are consistent and the coverage rates are as good as those from probit estimates on data without perfect predictors. Stokes might have felt better if he had seen the 3 pages of [R] probit dedicated to this topic. ---------------------------------------------------------------------- OK, we feel better about probit, what about other official estimators? ---------------------------------------------------------------------- The developers at Stata are always interested in improving the convergence and stability of our official estimators. That is one reason why we were so interested in McCullough and Vinod's paper and in particular what went wrong with the user-programmed ML estimator. We asserted earlier that the official estimators are stable and robust to virtually all datasets. Why are we so confident? First, the developers at Stata have a lot of experience implementing ML estimators and other estimators requiring nonlinear optimization. We can often identify potential problem areas just by looking carefully at the likelihood or by trying examples that we might expect to be deceptive to the optimizer. Second, Stata users estimate literally millions of models and in the rare cases where they see unusual results, they report those to us. Sometimes there are datasets that produce degenerate likelihood surfaces and the structure that leads to the degenerate surface cannot be specifically identified before estimation. In almost all such cases, it is clear that there is a problem -- either standard errors are missing, or there are messages near the end of convergence about the function being non-concave. We use such examples to improve pre-evaluation of the data, as with "perfect predictors" in -probit-, improve the manual entry describing the estimator, or to create FAQs explaining what causes such cases. Here are just a few of the things done by official estimators to ensure stability and robustness: 1) Remove collinear variables. Done by all estimators. 2) Identify perfect predictors. Done by -logit- and -probit- 3) Get good starting values, because far enough from the solution that maximizes the criterion (usually log-likelihood) almost all response surfaces are dominated by "bumps and wiggles" resulting from the limited precision of digital computers. Done by all estimators. 4) Estimate the model in a transformed metric that is either more numerically stable and/or that constrains parameters to required ranges (e.g. positive variances, correlations between -1 and 1, etc.). Done by -intreg-, -heckman-, and many others. 5) Estimate the model using a concentrated log-likelihood. Done by -boxcox-, -svar-, and others. 6) Identify constant within panel covariates when they cannot be estimated by panel-data estimators. Done by fixed-effects and first-differenced estimators. 7) Perform only likelihood ratio test statistics when the Wald statistics are unreliable for the estimator. Done by -boxcox-. 8) Trap singular estimates of the covariance matrix of disturbances. Done by -reg3- , -sureg-, and others. 9) Use alternate optimization methods, some specific to the estimator. Done by -sureg-, -reg3-, -xtgls-, and others. 10) Adopt new convergence criteria to prevent premature convergence. Done by -arch-, -arima-, and others. There are many others. -- Vince -- David vwiggins@stata.com ddrukker@stata.com References Stokes, Houston. 2003. On the advantage of using two or more econometric software systems to solve the same problem. Journal of economic and social (forthcoming). <end> * * For searches and help try: * http://www.stata.com/support/faqs/res/findit.html * http://www.stata.com/support/statalist/faq * http://www.ats.ucla.edu/stat/stata/

- Prev by Date:
**st: courtesy** - Next by Date:
**st: rreg v.s qreg regression** - Previous by thread:
**Re: st: How much can we trust Stata's non-linear solver(s)?** - Next by thread:
**st: courtesy** - Index(es):

© Copyright 1996–2017 StataCorp LLC | Terms of use | Privacy | Contact us | What's new | Site index |