Home  /  Resources & support  /  Users Group meetings  /  1997 UK Stata Users Group meeting  /  Maximum-likelihood in Stata

Maximum-likelihood in Stata

19 transparencies were used during this talk. They are presented below.


Transparency 1

Stata's maximizers
-------------------

Stata has one maximizer:

    1.  Modified Marquadt Newton-Raphson with additional modifications.

Stata has four interfaces into that one maximizer:

    1.  -lf-, linear form, a derivative-free method in the sense that it
        calculates numerical derivatives for you so all you have to supply is
        the likelihood function.  The "linear form" part I will explain later;
        it turns out that -lf- is suitable for only certain types of
        likelihood functions.

    2.  -deriv0-, another derivative-free method -- meaning it calculates
        numerical derivatives and all you specify is the likelihood function.
        -deriv0- differs from -lf- in that it is suitable for any likelihood
        function.

    3.  -deriv1-, a first-derivative method in the sense that you specify 
        the likelihood function and the first derivatives and it calculates
        second derivatives numerically.  -deriv1- is also suitable for 
        any likelihood function.

    4.  -deriv2-, in which you specify the likelihood function, the first 
        derivatives, and the second derivatives.  All -deriv2- does is 
        maximize and -deriv2-, like -deriv0- and -deriv1-, is suitable for
        any likelihood function.

Here is a comparison of the four methods:

        +---------------------------------------------------------------+
        |                  The four methods rated by                    |
        |         Ease of use         Accuracy        Speed             |
        |---------------------------------------------------------------|
        | Best     -lf-               -deriv2-        -deriv2-    Best  |
        |  ^       -deriv0-           -lf-            -lf-         ^    |
        |  |       -deriv1-           -deriv1-        -deriv1-     |    |
        | Worst    -deriv2-           -deriv0-        -deriv0-    Worst |
        +---------------------------------------------------------------+

Transparency 2

The organization of -ml-
-------------------------

Explaining how -ml- is structured will take away much of the mystery:

    +----------------------------------+          Stata
    | You input program to calculate   |         +---------------------------+
    | f(b) and maybe g(b) and H(b)     |-------->|   ................        |
    +----------------------------------+         |   : your program :-----+  |
                                                 |   :..............:<--+ |  |
    +----------------------------------+         |                      | |  |
    | You type commands to tell -ml-   |         |   ................   | |  |
    | about your problem and program   |---------+-->: -ml-         :   | |  |
    +----------------------------------+         |   :              :   | |  |
                                                 |   :              :   | |  |
    +----------------------------------+         |   :              :   | |  |
    | You tell -ml- to obtain estimates|---------+-->: ml uses your :---+ |  |
    +----------------------------------+         |   : program      :<----+  |
                                                 |   :              :        |
    +----------------------------------+         |   :              :        |
    | You tell -ml- to display results |---------+-->:              :        |
    +----------------------------------+         |   : ml displays  :        |
                                                 |   : results      :        |
    /----------------------------------\ <-------+---:              :        |
    | Iteration 0: ...                 |         |   :..............:        |
    | Iteration 1: ...                 |         +---------------------------+
    |                                  |
    | such-and-such estimates          |
    | Log Likelihood = -26.84419       |
    |                                  |
    | Coef.   Std. Err.   z   ...      |
    | ----------------------------     |
    | -.104    .0516    -2.02          |
    |                                  |
    \----------------------------------/

Transparency 3

In terms of the example above:

    +----------------------------------+
    | You input program to calculate   |
    | f(b) and maybe g(b) and H(b)     |
    +----------------------------------+
        . program define myll
                  local lnf "`1'"
                  local xb "`2'"
                  quietly replace `lnf' = ln(normprob(`xb')) if $S_mldepn==1
                  quietly replace `lnf' = ln(1-normprob(`xb')) if $S_mldepn==0
          end

    +----------------------------------+
    | You type commands to tell -ml-   |
    | about your problem and program   |
    +----------------------------------+
        . ml begin
        . ml function myll
        . ml method lf
        . eq myeq: foreign mpg weight
        . ml model b = myeq 
        . ml sample mysamp

    +----------------------------------+
    | You tell -ml- to obtain estimates|
    +----------------------------------+
        . ml maximize f V

    +----------------------------------+
    | You tell -ml- to display results |
    +----------------------------------+
        . ml post myest
        . ml mlout myest

Transparency 4

Here are the details of the internal structure of -ml-:

 ...........................................................................
 : -ml-                                                                    :
 :                        +----------------------<------------------+      :
 :                        |                                         |      :
 :  +----------+    +-----------+    +---------+    +--------+    +----+   :
 :  |setup, get|    |Get f(b),  |    |Calculate|    |Go some |    |    |   :
 :  |initial   |--->|g(b), H(b) |--->|direction|--->|distance|--->|b=b1|   :
 :  |values b  |    |  |        |    |vector d |    |along d |    |    |   :
 :  +----------+    +--|--------+    +---------+    |b1=b+s*d|    +----+   :
 :                     |                            +--------+             :
 :.....................|...................................................:
                       |
                       |       +-------------------+
                       |       | program you write |
                       +------>| Calculate f(b) and|---> (return to caller)
                               | maybe g(b), H(b)  |
                               +-------------------+

To maximize a likelihood function, you are required to write a Stata program
that, given values of the parameters b to be estimated, calculates the
log-likelihood function and, depending on the method chosen, perhaps the first
and second derivatives as well.  -ml- then calls your program when it needs
to.

The selected method (-lf-, -deriv0-, -deriv1-, or -deriv2-) modifies how -ml-
works.  All of these modifications occur in the "Get f(b), g(b), and H(b)"
box.  

Transparency 5

Blowup of "Get f(b), g(b), and H(b)" for method -deriv2-
--------------------------------------------------------

 .............................................................................
 : -ml-                                                                      :
 : +----------------+                                                        :
 : |Get             |                                                        :
 : |f = lnL(b;Y,X)  |                                                        :
 : |g = lnL'(b;Y,X) |                                                        :
 : |H = lnL''(b;Y,X)|                                                        :
 : +---|------------+                                                        :
 :.....|.....................................................................:
       |
       |                       +-------------------+
       |                       | program you write |          
       |                       | Calculate         |
       +---------------------->| f = lnL(b;Y,X)    |------> (return to caller)
                               | g = lnL'(b;Y,X)   |      
                               | H = lnL''(b;Y,X)  |     
                               +-------------------+
                      (receives b, Y, X)       (sends f, g, -H)

-ml-'s -deriv2- code is purely an optimizer, and the calculation of f, g, and
H is completely done in the program that you write.  (Yes, -deriv2- requires
that you return -H rather than H; more about that in future lectures.)


Blowup of "Get f(b), g(b), and H(b)" for method -deriv1-
--------------------------------------------------------

 .............................................................................
 : -ml-                                                                      :
 : +---------------+    +-----------+                                        :
 : |Get            |    |Calc.      |                                        :
 : |f = lnL(b;Y,X) |--->|H = g'()   |                                        :
 : |g = lnL'(b;Y,X)|    |numerically|                                        :
 : +---|-----------+    +--|--------+                                        :
 :.....|...................|.................................................:
       |                   |
       |  +----------------+
       |  |                    +-------------------+
       |  |                    | program you write |
       |  +------------------->| Calculate         |----> (return to caller)
       +---------------------->| f = lnL(b;Y,X)    |
                               | g = lnL'(b;Y,X)   |
                               +-------------------+
                     (receives b, Y, X)         (sends f, g)


-deriv1- makes your job a little easier; the program you write calculates the
log-likelihood and the first derivatives (gradient vector).  -deriv1- obtains
H, the matrix of second derivatives numerically.

Transparency 6

Blowup of "Get f(b), g(b), and H(b)" for method -deriv0-
---------------------------------------------------------

 .............................................................................
 : -ml-                                                                      :
 : +---------------+    +-----------+    +-----------+                       :
 : |Get            |    |Calc.      |    |Calc.      |                       :
 : |f = lnL(b;Y,X) |--->|g = f'()   |--->|H = f''()  |                       :
 : |   |           |    |numerically|    |numerically|                       :
 : +---|-----------+    +--|--------+    +--|--------+                       :
 :.....|...................|................|................................:
       |                   |                |
       |  +----------------+                |
       |  |  +------------------------------+
       |  |  |
       |  |  |                 +-------------------+
       |  |  +---------------->| program you write |
       |  +------------------->| Calculate         |----> (return to caller)
       +---------------------->| f = lnL(b;Y,X)    |
                               +-------------------+
                      (receives b, Y, X)         (sends f) 

-deriv0- requires your program to calculate only the log-likelihood.  It
obtains g, the gradient vector, and H, the matrix of second derivatives,
numerically.


Blowup of "Get f(b), g(b), and H(b)" for method -lf-
----------------------------------------------------

..............................................................................
: -ml-                                                                       :
: +-------+   +----------+   +-----------+   +-----------+   +-----------+   :
: |Calc.  |   |Get f_j = |   |Calc.      |   |Calc.      |   |Calc.      |   :
: |I_j =  |-->|f(y_j,I_j)|-->|f =        |-->|g = f'()   |-->|H = f''()  |   :
: |  x_j*b|   |  |       |   |  Sum f_j()|   |numerically|   |numerically|   :
: +-------+   +--|-------+   +-----------+   +--|--------+   +--|--------+   :
:................|..............................|...............|............:
                 |                              |               |
                 |  +---------------------------+               |
                 |  |  +----------------------------------------+
                 |  |  |
                 |  |  |         +-------------------+
                 |  |  +-------->| program you write |       
                 |  +----------->| Calculate         |----> (return to caller)
                 +-------------->| f_j = f(y_j, I_j) |                   
                                 +-------------------+
                      (receives y_j, I_j)        (sends f_j)
                               

-lf- looks similar to -deriv0- -- you have to look carefully to see the
differences.  

Transparency 7

The -lf- method
---------------

As I already mentioned, -lf- is my favorite but it places two restrictions 
on the form of the likelihood function.

First, -lf- requires that the physical observations in the dataset correspond
to independent pieces of the likelihood function.  That is, the general
problem is

        max   ln L(b;y,X)
         b

and -lf- requires that you be able to write ln L(b;X) as

        ln L(b;X)  =  f(b;y_1,x_1) + f(b;y_2,x_2) + ... + f(b;y_n,x_n)

where f(b;y_1,x_1) corresponds to the first observation in the dataset,
f(b;y_2,x_2) the second observation, and so on.  Actually, -lf- is willing to
let you exclude observations with -if <exp>- and -in <range>-.  The
requirement is really that f() not be a function of more than one observation;
that the likelihood of one observation can be calculated independently of all
the other observations.

This restriction is commonly (but not always) met in statistical analyses.
It corresponds to 

        Pr(dataset) = Pr(obs 1) * Pr(obs 2) * ... * Pr(obs _N)

Anytime we have a likelihood function and we can write it down as 

        "The log-likelihood for the jth observation is f_j = ..."

and it is obvious that
                                                        N
        "The overall log-likelihood function is ln L = Sum f_j"
                                                       j=1

we have a likelihood meeting this restriction.

The following are examples of likelihoods that DO NOT meet this criterion:
the conditional (fixed-effects) logit model, Cox regression, and likelihoods
for panel data (i.e., cross-sectional time-series models).  Basically, any
likelihood that explicitly models independent groups rather than independent
observations does not meet the first -lf- criterion.

Transparency 8

The second restriction required by -lf-
---------------------------------------

The second restriction of -lf- is that the log-likelihood function for the jth
observation is not just any function of x_j and b but that it is a function of
x_j MULTIPLIED BY b (i.e., a linear form, hence the name -lf-):

        f_j = f(b;y_j,x_j) = f(y_j, x_j*b)

By "*" or "multiplied by" I mean inner product, 

        f_j = f(b;y_j,x_j) = f(y_j, x_j*b) 
                           = f(y_j, x_j1*b_1 + x_j2*b_2 + ... + x_jk*b_k)

This might be better emphasized by writing the log-likelihood function as 

        f_j = f(y_j,I_j)

where 

        I_j = x_j1*b1 + x_j2*b2 + ... + x_jk*b_k

In fact, this is exactly how you want to think about it because the program
-lf- requires you to write exactly corresponds to f(y_j,I_j).  -ml-, in
effect, passes to your program two arguments, y_j and I_j -- these are scalars
such as the numbers 2 and 5 -- and expects you to return the corresponding f_j
value.

Actually, this second restriction is not much of a restriction, since -lf-
allows multiple linear forms.  More on this later in this lecture.

Transparency 9

              Approximate number of calls to the likelihood
             evaluation program required to obtain direction 
         (approximate because search for optimal h is not counted)


            k      lf         deriv0      deriv1        deriv2
        --------------------------------------------------------
            1       2              2           2             1
            2       2              6           4             1
            3       2             12           6             1
            4       2             20           8             1
            5       2             30          10             1
           10       2            110          20             1
           20       2            420          40             1
           40       2          1,640          80             1
           50       2          2,550         100             1
        --------------------------------------------------------

So, -lf- is faster than -deriv0- and faster than -deriv1- once k>2.

Do not expect total run times to vary in
proportion the numbers, however, for two reasons:

  1.  Forty parameters, estimated by -deriv0-, do not take 1,640/2 = 820 times
      longer than when estimated by -lf-.  Obtaining the derivatives is only
      one component of maximization step, albeit an important one.  There is a
      component of -ml- that takes 820 times longer, but there are other
      components as well.

  2.  The table ignores the search for the optimal h and that, too, requires
      function calls.  As well be explained when we discuss accuracy, we
      invest considerably more in the h calculation for -lf- than in either 
      -deriv0- or -deriv1-.


Transparency 10

Numerical root finding
----------------------

In one dimension (b and g(b) both scalars), one method of finding roots 
is Newton's method.  This is an iterative technique.  You have a guess b0 
(called the initial value) and you update that as follows:

                        line has slope
   g(b)                 g'(b0)=dg/db at b0     Newton's Method:
     |                     /                   -----------------------
     |                   ./                    1.  current guess is b0.
     |  dots are g(b)   ./     lines goes      2.  calculate g(b0).
     | (connect them)  ./  <-- through point   3.  calculate g'(b0).
     |               . /|      (b0,g(b0))      4.  draw line through point
     |             .  / |                          (b0,g(b0)), slope g'(b0).
     |          .    /  |                      5.  New guess is b1, point
     |       .      /   |                          where line is zero.
     +-----.-------+----+------- b             6.  Repeat.
     |   .        /b1   b0
     | .         / 

      Formula for line:  g'(b0)*(b -b0) + g(b0)                     g(b0)
      line at b1 == 0 :  g'(b0)*(b1-b0) + g(b0) = 0  =>  b1 = b0 - ------
                                                                   g'(b0)

Now you have a new value b1 and you can repeat the process, 

                 g(b1)
      b2 = b1 - ------
                g'(b1)

and so on.  The sequence is not guaranteed to converge but it generally does.

Newton's Method for finding roots can be converted to finding minimums and 
maximums by substituting f'() for g():

        To find b such that f(b) is maximized, 
             1.  start with a guess b0.              f'(b0)
             2.  calculate a new guess b1, b1 = b0 - ------
             3.  repeat.                             f''(b0)
        

We can generalize this to allow b to be a vector:

        To find vector b such that f(b) is maximized, 
            1.  start with a guess b0.                 -1
            2.  calculate a new guess b1, b1 = b0 - g H  , where g is 
                the gradient vector and H the matrix of second derivatives
                (also known as the Hessian).
            3.  repeat.

Transparency 11

Numerical derivatives
---------------------

This maximizer that I have sketched has substantial analytic requirements:

     In addition to specifying f(b) = ln L(b;X), the routine needs to be 
     able to calculate g(b) = df/db and H(b) = d2f/db2.  It needs the 
     function, its first derivatives, and its second derivatives.

Stata provides facilities so that you do not actually have to program the
derivative calculations but really, Stata has only one optimizer.  When you do
not provide the derivatives, Stata calculates them numerically.

The definition of an analytic derivative is 
         
                  df |           f(x0+h) - f(x0)
        f'(x0) =  -- |   = lim   ---------------
                  dx |     h->0         h
                     |x0

and that leads to a simple approximation formula, 

                                  f(x0+h) - f(x0)
        f'(x0) is approximately   ---------------   for a suitably small
                                         h          but large enough h.

I have a lot to say about how Stata finds a suitably small but large enough h,
and about variations on this formula, but put all that aside.  The fact that
Stata uses a formula like [f(x0+h)-f(x0)]/h has an important implication for
your program that calculates f(), the log likelihood function.

If Stata chooses h just right, in general this formula will only be about half
as accurate as the routine that calculates f().  (It won't be less accurate
than that and it might be more accurate).  The loss of accuracy comes about
because of the subtraction in the numerator.

Transparency 12

Let me show you:  let's consider calculating the derivative of exp(x) at x=2.
The true answer is exp(2) because d(exp(x))/dx = exp(x).  Let's try this
formula with h=1e-8 and I will carefully do this calculation with 16 digits of
accuracy:

        exp(2 + 1e-8)      =  7.389056172821211   (accuracy is 16 digits)
        exp(2)             =  7.389056098930650   (accuracy is 16 digits)
        ---------------------------------------
        difference         =   .000000073890561

        exp(2)             =  7.389056098930650   (true answer)
        difference / 1e-8  =  7.389056033701991   (approximation formula)
        ---------------------------------------
        error                  .000000065228659   approximation is correct to
                              1 2345678           8 digits


The major source of error was introduced when we calculated the difference 
exp(2 + 1e-8) - exp(2).  

        exp(2 + 1e-8)      =  7.389056172821211   (accuracy is 16 digits)
        exp(2)             =  7.389056098930650   (accuracy is 16 digits)
        ---------------------------------------
        difference         =   .000000073890561
                                       12345678   (accuracy is 8 digits)

Our full 16 digits of accuracy was lost since half the digits of exp(2 + 1e-8)
and exp(2) were in common and so canceled.

This is an unpleasant feature of numerical derivatives.  Done right, if you
start with k digits of accuracy, you will fall to k/2 digits at worst.  (You
can get lucky, the best case being when f(x0)=0 in which case you can get all
k digits back, ignoring the inaccuracy introduced by the algorithm itself.)

Thus, as a programmer of likelihood functions, if you are going to use
numerical derivatives, it is vitally important that you supply ln L() as
accurately as you can.  This means all your -generate- statements should
explicitly specify double:

        . gen double ... = ...


Transparency 13

The other issue in calculating numerical derivatives is choosing h.  Stata
does that for you.  Just so that you are properly appreciative, let's consider
that problem:

      If Stata chooses a value for h that is too small, f(x0+h) will
      numerically equal f(x0) and then the approximation to f'(x0) will be
      zero.

      Values larger than that, but still too small, result in poor accuracy
      due to the numerical roundoff error associated with subtraction that we
      have just seen.  If f(x0+h) is very nearly equal f(x0), nearly all the
      digits are in common and the difference has few significant digits.  For
      instance, say f(x) and f(x+h) differ only in the 15th and 16th digits:

            f(x+h)             =  x.xxxxxxxxxxxxxab
            f(x)               =  x.xxxxxxxxxxxxxcd
            ---------------------------------------
            difference         =  0.0000000000000ef   (2 digits of accuracy)

      Even worse, the 16 digit numbers shown might be accurate to only 14
      digits themselves.  In that case, the difference f(x+h)-f(x) would be
      purely noise.

      On the other hand, if Stata chooses value for h that is too large, the
      approximation formula itself will not estimate the derivative
      accurately.

Transparency 14

Choosing the right value of h is of great importance.  Let me show you.  A
minute ago we considered numerically evaluating the derivative of exp(x) at
x=2.  Below I do that I calculate the error for a variety of h's:

        . drop _all

        . input h

                     h  
          1. 1e-20
          2. 1e-19
          3. 1e-18
          <output omitted>
         19. 1e-2
         20. 1e-1
         21. 1
         22. end

        . gen double approx  = (exp(2+h)-exp(2))/h

        . gen double error = exp(2) - approx

        . gen double relerr = error/exp(2)

        . list

                     h      approx       error      relerr  
          1.  1.00e-20           0   7.3890561           1  
          2.  1.00e-19           0   7.3890561           1  
          3.  1.00e-18           0   7.3890561           1  
          4.  1.00e-17           0   7.3890561           1  
          5.  1.00e-16           0   7.3890561           1  
          6.  1.00e-15   6.2172489   1.1718072   .15858686  
          7.  1.00e-14   7.5495167   -.1604606  -.02171598  
          8.  1.00e-13   7.3807628    .0082933   .00112238  
          9.  1.00e-12   7.3896445  -.00058838  -.00007963  
         10.  1.00e-11   7.3890228   .00003334   4.512e-06  
         11.  1.00e-10   7.3890582  -2.057e-06  -2.783e-07  
         12.  1.00e-09   7.3890567  -5.878e-07  -7.956e-08  
         13.  1.00e-08   7.3890561   2.032e-08   2.750e-09  
         14.  1.00e-07   7.3890565  -3.636e-07  -4.920e-08  
         15.  1.00e-06   7.3890598  -3.694e-06  -5.000e-07  
         16.  1.00e-05    7.389093  -.00003695  -5.000e-06  
         17.     .0001   7.3894256  -.00036947     -.00005  
         18.      .001   7.3927519  -.00369576  -.00050017  
         19.       .01   7.4261248  -.03706874  -.00501671  
         20.        .1   7.7711381  -.38208204  -.05170918  
         21.         1   12.696481  -5.3074247  -.71828183  

Transparency 15

In comparing these alternative h's, you should focus on the relerr column.
The minimum relative error in this experiment is recorded in observation 13, 
h = 1e-8.

So, should h be 1e-8?  Not in general.  1e-8 is a good choice for calculating
the derivative of exp() at 2, but at some other location, a different value of
h would be best.  Even if you map out the entire exp() function, you will not
have solved the problem.  Change the function and you change the best value of
h.

J. C. Nash (1990, 219)

    Compact Numerical Methods for Computers:  Linear Algebra and Function
    Minimisation, 2d ed., New York: Adam Hilger,

suggested that one choose h so that x+h differs from x in at least half of its
digits (the least significant ones).  In particular, Nash suggested the
formula

        h(x) = e*(x+e)

where e is the square root of machine precision.  Machine precision is roughly
1e-16 for double precision and therefore h(2) = 2e-8.  This way of dynamically
adjusting h to the value of x works pretty well.

Bill Sribney and I worked on this problem and came to the conclusion that
Nash's suggested solution could be improved upon.  The are two issues that 
require balancing:

    1)  If calculation were carried out infinite precision, the smaller is h,
        the more accurately [f(x+h)-f(x)]/h would approximate the derivative.

    2)  Taking into consideration finite precision, the closer are the values
        f(x+h) and f(x), the fewer are the significant digits in the
        calculation f(x+h)-f(x).

Thus, we use a variation on Nash's suggestion and control the numerator:

        f(x+h) and f(x) should differ in about the half digits.

That is, we adopt the rule of setting h as small as possible subject to the
constraint that f(x+h)-f(x) will be calculated to at least half accuracy.

This is a computationally expensive rule to implement because, each time a
numerical derivative is to be calculated, the program must search for the
optimal value of h meaning that f(x+h) must be calculated for each trial
value.  The payoff, however, is that -ml- is remarkably robust.  More than
once we have been told by a user of having a miserable experience attempting
maximizing a function using some other package and that they had to grope
around for just the right starting values.  They tried the function using 
-ml- and there was no problem no matter what the starting values.

Transparency 16

Why?, they wondered.  Well, now you know.  It is the numerical derivative 
calculation.  The cost of this robustness, however, is that -ml- uses 
more computer time to get to the result.

-ml- also uses a centered derivative calculation.  Rather than using 

                                  f(x0+h) - f(x0)
        f'(x0) is approximately   ---------------
                                         h

-ml- uses

                                  f(x0+h/2) - f(x0-h/2)
        f'(x0) is approximately   ---------------------
                                            h

This also ups the computational time required but results in an important
improvement.

Transparency 17

Numerical second derivatives and the accuracy of estimated variance
-------------------------------------------------------------------

Remember that Newton-Raphson not only requires f'(x0), it requires f''(x0), 
too, the second derivatives.  We use the same method for calculating this
as we use for calculating the first derivatives, 

                                   f'(x0+h/2) - f'(x0-h/2)
        f''(x0) is approximately   -----------------------
                                              h

So, if you make the substitution for f'(x0), you obtain:

                                   f(x0+h) - 2*f(x0) + f(x0-h)
        f''(x0) is approximately   ---------------------------
                                             h*h

I want you to think about the accuracy of this.  If f() is calculated to 16
decimal places, f'() will be accurate to 16/2 decimal places (at worst).  If
f'() is accurate to 8 decimal places, f''() will be accurate to 8/2 decimal
places (at worst).  Thus, f''() is accurate to 4 decimal places (at worst).

So that is the accuracy of your standard errors.  If you start with 16 digits
of accuracy, they will be good to 4 digits -- maybe more if you are lucky --
but maybe a bit less due to the fact that the approximation formula for the
calculation would still be only an approximation if executed at infinite
precision.  Figure 4 digits.  If your routine that calculates the likelihood
function is good to 12 digits, your standard errors are good to 3 digits.  If
your routine is good to 8 digits, your standard errors are good to 2 digits.

If you want more accuracy than that, you must code the analytic derivatives.

Transparency 18

The -ml- method -lf- checklist
------------------------------

  1.  Write your likelihood-evaluation program.  In what follows I will 
      assume you name it -myll-.  The outline for the program is:

                program define myll
                        local f "`1'"
                        local I "`2'"
      optional ->       local I2 "`3'"
                        ...

      optional ->       parse "$S_mldepn", parse(" ")
                        local y1 "`1'"
                        local y2 "`2'" 
                        ...

       optional ->      tempvar work1 work2 ...
                        gen double `work1' = ...
                        gen double `work2' = ...

                        replace `f' = ...
                end


  2.  Define your equations:
                . eq myeq: ...
                . eq myeq2: ...         <- optional
                . ...

  3.  Give the -ml- commands to setup the problem:

                . ml begin
                . ml function myll
                . ml method lf
                . ml model b = myeq [myeq2 ...] [, depv(...) cons(...)]
                . ml sample mysamp [if <exp>]

  4.  Search for initial values (optional)

                . ml search, limits(myeq # # [myeq2 # # ...])

  5.  Perform the optimization

                . ml maximize f V

  6.  Post the results 

                . ml post myest
                . ml mlout myest

Transparency 19

The -ml- method -lf- do-file
----------------------------

I use do-files when I have to work with -ml-.  There are just too many places
where I can make some silly mistake.  Although in most cases I can just back
up and fix it, I do not really feel comfortable unless the whole thing runs
without error from start to finish.

So here is how I organize my do-file:

        ----------------------------- top --- sample do-file --------
        capture log close
        log using somefile, replace
        program drop _all
        set more off

        clear
        use such-and-such            /* load data              */
        eq myeq : ...                /* define equations first */
        eq ...

        program define myll          /* define program         */
                ...
        end

        ml begin
        ml function myll
        ml method lf
        ml model b = myeq ...
        ml sample mysamp 

        ml search, limits(<#> <#> or myeq # # ...)
        
        ml maximize f V
        ml post myest
        ml mlout myest
        log close
        ----------------------------- end --- sample do-file --------

Here is what I would like you to notice about my do-file:

    1.  When I run it, I am certain that there is no remnant of some previous
        problem laying around that will cause problems and confuse me.  I
        -program drop _all-.  I -clear- (which gets rid of any old equations.)
        I reload the data on which I intend to run.

    2.  I set the specific problem to be estimated at the top and after 
        that include the -program define- and -ml- code.  This way, I can
        easily change the specific model (mpg on weight and displ) 
        easily.