Stata The Stata listserver
[Date Prev][Date Next][Thread Prev][Thread Next][Date index][Thread index]

Re: st: How much can we trust Stata's non-linear solver(s)?

From (Vince Wiggins, StataCorp)
Subject   Re: st: How much can we trust Stata's non-linear solver(s)?
Date   Thu, 28 Aug 2003 09:57:46 -0500

 |Part 1|

Joachim Wagner <> has read McCullough and Vinod's
American Economic Review (AER) article and is now concerned about the results
he gets from statistical software.  There haven't been many followup questions
or responses, so perhaps everyone on the list just knows that Stata is doing
things right, or perhaps there is just not much interest.  As someone who does
optimization for a living, we find the issues interesting and have quite a bit
to say, so here is most of Joachim's question and a rather long response,

> at the weekend I read a paper by McCullough and Vinod on "Verifying
> the solution from a nonlinear solver: A case study" in the June 2003
> issue of the American Economic Review. For those of you who are not
> in economics, this is an absolute top-journal in our discipline, and
> the authors are considered to be top-experts in the field. I was
> shocked (really, believe me!) to learn how much can go wrong and
> goes wrong when one tries to find the ML solution of such day-to-day
> problems as the parameters of a probit model using a PC and several
> programs. The authors give examples, and they argue that "the
> researcher's job is not done when the program reports convergence -
> it is only beginning" (p. 876), and ask us to examine the gradient,
> inspect the solution path, evaluate the Hessian, including an
> eigensystem analysis, and profile the likelihood. Uff. Do I really
> have to learn how to do all this? Do I have to use three or more
> different packages to compare the solutions? Shall I demand this
> when I sit down to write my next referee report later this week?
> Must I be aware that the next report I receive will demand me to do
> all this?
> Unfortunately, the authors do not name horses and riders (as we say
> in Germany - in German, of course), so I have no idea whether Stata
> is among the programs one can trust or not. [...]

Joachim goes on to ask

>  From a slightly different perspective, would it be a good idea to
>  implement the steps suggested by the authors for a
>  "post-convergence" check in Stata ?
>  [...]

The VERY short answers

There are really two main issues taken up by McCullough and Vinod:

	1) You may not be able to trust such simple estimators as probit
           because a forthcoming article by Stokes (2003) got different
           results from 7 different software packages.  Of particular concern,
           the model and data were taken from a simple example in G.S.
           Madalla's Introduction to Econometrics text where the reported
           result was also "wrong".  (Note, this is from the data in table 8.4
           in the 3rd edition)

	2) You should extensively evaluate any maximum likelihood (ML) result
           because the authors could not reproduce the results from one
           user-written ML estimator of voter participation, the results of
           which were published in an earlier AER paper.

           They tried replication with several packages (though we infer not
           Stata).  Four packages were unable to maximize the likelihood, a
           5th was able to maximize to a likelihood that agreed to 11
           significant digits with the 6th, and author's "gold standard"
           result, was obtained using automated analytic derivatives and TSP.

           They go on to conclude that while the likelihood could be maximized
           using analytic derivatives, the combination of data and model had
           insufficient information to draw inferences, mainly because the
           condition number of the Hessian was too large and because two of
           the profile likelihoods were deemed insufficiently quadratic in

	   The authors suggest 4 steps to evaluate an estimate.

There was a larger issue about replication of results in the social sciences
and the difficulty in obtaining the replication data and programs, but that is
beyond this post.

Short answer to 1.

    Stata performs flawlessly on the probit example.  We call the "problem" in
    this data "perfect prediction".  It is treated correctly in Stata and has
    been for 12 years, it is discussed over 3 pages in [R] probit, and there
    are even options in -predict- after -probit- for handling such cases --
    see option -rules-.

Extending this answer to other official estimators

    Great care has been taken to ensure that Stata's official estimators have
    good optimization behaviour.  A few of the things that are routinely done
    include:  identify and correct problems in the data, get good starting
    values, transform the metric in which parameters are estimated, maximize
    concentrated likelihoods, use alternate optimization methods or alternate
    model realizations.  For almost all official estimators this is enough to
    ensure that your estimates are stable and robust to virtually all

    Even so, optimization can be hard and some likelihoods are particularly
    difficult.  For official estimators with difficult likelihoods, e.g.
    -heckman-, -arch-, -xtprobit-, -boxcox-, the difficulty is discussed in
    the manual entry and suggestions are made or tools are provided to help
    diagnose when convergence may not have been achieved, and to evaluate the
    stability of the estimates (e.g. -quadchk-).

Answer to 2 (OK, not that short)

    Writing ML estimators can be tricky and we encourage all those who write
    estimators to look carefully at their results.  The authors' four
    suggested checks strike me as a fine beginning with two and almost three
    of these easily automated by specifying options to -ml-.  More
    importantly, always simulate some data with known coefficients and run
    your estimator on it so you can check all of your computations.

    When we wrote the user-programmed ML estimator (estimating the parameters
    of a model of voter participation) in Stata (using method lf) and turned
    our estimator loose on the data graciously provided by Bruce McCullough,
    it marched straight to a solution in 8 iterations.  Given the difficulties
    encountered by the authors, we expected there to be many more iterations,
    some with "not concave" messages and possibly some missing or very large
    standard errors in the regression table, the latter indicative of a poorly
    conditioned covariance matrix.  We saw none of these things.

    The log-likelihood of our estimates agreed with those reported as the "gold
    standard" by the authors to "only" 8 significant digits.  This did not
    surprise or bother us -- we benchmark ML estimators on about 15 different
    platforms where there are differences in operating systems, math
    libraries, and CPUs.  We would not be happy with 8 digits, but that is
    because we control the algorithms.  The theoretically best that can be
    hoped for is to agree to 15 digits (sometimes 16), but that is never
    achieved.  Even changing from an Intel Celeron to an Intel Pentium will
    often lead to numbers that agree at "only" the 13th or 14th decimal point.
    With the current problem we are talking about different computers,
    different operating systems (we ran under Linux), different math libraries,
    and different optimization algorithms.  
    The important question is do the solutions support the same statistical
    inferences and that we did agree with the "gold standard"?  Both packages
    reported from 4 to 8 significant digits for the coefficient and standard
    error (SE) estimates.  When rounded to the number of digits reported, all
    of SEs agreed exactly and all but two of the coefficients agreed exactly,
    with the two coefficients still agreeing to 5 significant digits.

    McCullough and Vinod do not report SEs for any of the estimators because
    their eigenvalue analysis of the hessian could not clearly establish that
    the VCE was of full rank.  They are particularly concerned with the
    condition number of the hessian and what it implies about the accuracy of
    the eigenvalues and by inference about the rank of the VCE.  While a
    smaller condition number would be nice, our results indicate that the
    parameter and SE estimates for this model are just fine, which casts some
    doubt on the utility of using the condition number when deciding whether
    you have sufficient information for inference.  Two results are
    particularly telling,

    	B1. You can easily increase or decrease the condition number by
            scaling one or more variables in the model.  We tested scalings
            that decreased the condition number (this should be better) down
            to 2.4e5 from the original scalings 6.5e9 and also up to 5.2e10.
            In all cases the coefficients remained the same to the number of
            digits shown in output, of course they were scaled.  More
            importantly, the z-statistics and associated p-values remained the
            same regardless of the scaling and implied condition number.

	B2. We bootstrapped our ML estimator using 1000 replicates and
            importantly had no trouble with convergence in any of the 1000
            replicates.  Bootstrapped standard errors and confidence intervals
            were generally in agreement with their asymptotic analogues
            reported by the ML estimator.  They were different, and in two
            cases marginally significant regressors became marginally
            insignificant, but on the whole they were similar.

    McCullough and Vinod make 4 specific suggestions:

	C1. Examine the gradient -- is it zero?

	C2. Inspect the solution path (i.e. trace) -- does it exhibit the
	    expected rate of convergence?  <that is to say quadratic>

	C3. Evaluate the Hessian, including an eigensystem analysis -- is the
	    Hessian negative definite?  Is it well-conditioned?

	C4. Profile the likelihood to assess the adequacy of the quadratic

    These are all good suggestions, though there is no clear-cut way to
    establish a well-conditioned hessian.  That was the point of item B1
    There is also some overlap.  Suggestions C2 and C4 both have something to
    say about the shape of the hessian in the neighborhood of the solution,
    though C4 is more specific.

    We claimed earlier that two and almost three of these suggestions can be
    automated by Stata.  Specifying the -nrtolerance(#)- option on most
    official estimators or on -ml maximize- for user-written estimators 
    specifies that the hessian scaled gradient

               tol = gH g'

    be less than #.  This implies that the hessian must be positive definite
    when convergence is declared, because the full inverse is required to
    compute the tolerance.  It also implies that the gradient is 0 within our
    ability to define 0 using the shape of the likelihood -- the hessian.  We
    have been considering making -nrtolerance()- the default tolerance for
    Stata's official estimators.  A good value for # is 1e-6.

    Alternately, specifying the -gradient- option on most estimators or on -ml
    maximize- will cause -ml- to display the gradient at each iteration.

    Suggestion C3 can be obtained for most official estimators and all
    user-programmed estimators by typing -ml graph- after estimation.  A graph
    tracing the last 20 iterations of the log likelihood is show.  These
    values are also available in the matrix e(ilog).

    The profile likelihood, suggestion C4, currently requires coding a loop
    over the values of the coefficient to be profiled and using either the
    -constraint()- or -offset()- options to lock one of the coefficient
    values.  Profile likelihoods are just one of many ways to evaluate the
    validity of asymptoticly justified statistics, and one of the least
    direct.  More direct options include bootstrapping when evaluating
    performance on specific samples or Monte Carlo simulation when evaluating
    an estimator in general.

    What do we suggest?

    If you are using an official estimator:

        The asymptotic properties of the SEs and CIs reported by official
        estimators have already been established, otherwise Stata would not
        report them.  If you are concerned with the small-sample performance
        of an official estimator on your specific data, bootstrap and, if you
        are worried about asymmetry, look particularly at the percentile CI.

    If you are programming your own estimator:
        Simulate some data from you statistical model and use that to check
        your estimator.  This isn't really related to convergence, though you
        could use it to establish correct coverage for the null, but you can
        be much surer that you have coded you likelihood correctly.  Use the
        -nrtolerance()- option and look carefully at your iteration log.  If
        you have many "not concave" messages near the end of you iteration
        log, you may have insufficient information to estimate the model.  If
        you are worried that you data is insufficient to support asymptotic
        statistics, bootstrap.  If you are failing to converge, look into
	rescaling some of the variables.

<See Part 2 that is in another post>

-- Vince                      -- David  

*   For searches and help try:

© Copyright 1996–2017 StataCorp LLC   |   Terms of use   |   Privacy   |   Contact us   |   What's new   |   Site index