Home / Resources & support / Users Group meetings / 1997 UK Stata Users Group meeting / Maximum-likelihood in Stata

19 transparencies were used during this talk. They are presented below.

Stata's maximizers ------------------- Stata has one maximizer: 1. Modified Marquadt Newton-Raphson with additional modifications. Stata has four interfaces into that one maximizer: 1. -lf-, linear form, a derivative-free method in the sense that it calculates numerical derivatives for you so all you have to supply is the likelihood function. The "linear form" part I will explain later; it turns out that -lf- is suitable for only certain types of likelihood functions. 2. -deriv0-, another derivative-free method -- meaning it calculates numerical derivatives and all you specify is the likelihood function. -deriv0- differs from -lf- in that it is suitable for any likelihood function. 3. -deriv1-, a first-derivative method in the sense that you specify the likelihood function and the first derivatives and it calculates second derivatives numerically. -deriv1- is also suitable for any likelihood function. 4. -deriv2-, in which you specify the likelihood function, the first derivatives, and the second derivatives. All -deriv2- does is maximize and -deriv2-, like -deriv0- and -deriv1-, is suitable for any likelihood function. Here is a comparison of the four methods: +---------------------------------------------------------------+ | The four methods rated by | | Ease of use Accuracy Speed | |---------------------------------------------------------------| | Best -lf- -deriv2- -deriv2- Best | | ^ -deriv0- -lf- -lf- ^ | | | -deriv1- -deriv1- -deriv1- | | | Worst -deriv2- -deriv0- -deriv0- Worst | +---------------------------------------------------------------+

The organization of -ml- ------------------------- Explaining how -ml- is structured will take away much of the mystery: +----------------------------------+ Stata | You input program to calculate | +---------------------------+ | f(b) and maybe g(b) and H(b) |-------->| ................ | +----------------------------------+ | : your program :-----+ | | :..............:<--+ | | +----------------------------------+ | | | | | You type commands to tell -ml- | | ................ | | | | about your problem and program |---------+-->: -ml- : | | | +----------------------------------+ | : : | | | | : : | | | +----------------------------------+ | : : | | | | You tell -ml- to obtain estimates|---------+-->: ml uses your :---+ | | +----------------------------------+ | : program :<----+ | | : : | +----------------------------------+ | : : | | You tell -ml- to display results |---------+-->: : | +----------------------------------+ | : ml displays : | | : results : | /----------------------------------\ <-------+---: : | | Iteration 0: ... | | :..............: | | Iteration 1: ... | +---------------------------+ | | | such-and-such estimates | | Log Likelihood = -26.84419 | | | | Coef. Std. Err. z ... | | ---------------------------- | | -.104 .0516 -2.02 | | | \----------------------------------/

In terms of the example above: +----------------------------------+ | You input program to calculate | | f(b) and maybe g(b) and H(b) | +----------------------------------+ . program define myll local lnf "`1'" local xb "`2'" quietly replace `lnf' = ln(normprob(`xb')) if $S_mldepn==1 quietly replace `lnf' = ln(1-normprob(`xb')) if $S_mldepn==0 end +----------------------------------+ | You type commands to tell -ml- | | about your problem and program | +----------------------------------+ . ml begin . ml function myll . ml method lf . eq myeq: foreign mpg weight . ml model b = myeq . ml sample mysamp +----------------------------------+ | You tell -ml- to obtain estimates| +----------------------------------+ . ml maximize f V +----------------------------------+ | You tell -ml- to display results | +----------------------------------+ . ml post myest . ml mlout myest

Here are the details of the internal structure of -ml-: ........................................................................... : -ml- : : +----------------------<------------------+ : : | | : : +----------+ +-----------+ +---------+ +--------+ +----+ : : |setup, get| |Get f(b), | |Calculate| |Go some | | | : : |initial |--->|g(b), H(b) |--->|direction|--->|distance|--->|b=b1| : : |values b | | | | |vector d | |along d | | | : : +----------+ +--|--------+ +---------+ |b1=b+s*d| +----+ : : | +--------+ : :.....................|...................................................: | | +-------------------+ | | program you write | +------>| Calculate f(b) and|---> (return to caller) | maybe g(b), H(b) | +-------------------+ To maximize a likelihood function, you are required to write a Stata program that, given values of the parameters b to be estimated, calculates the log-likelihood function and, depending on the method chosen, perhaps the first and second derivatives as well. -ml- then calls your program when it needs to. The selected method (-lf-, -deriv0-, -deriv1-, or -deriv2-) modifies how -ml- works. All of these modifications occur in the "Get f(b), g(b), and H(b)" box.

Blowup of "Get f(b), g(b), and H(b)" for method -deriv2- -------------------------------------------------------- ............................................................................. : -ml- : : +----------------+ : : |Get | : : |f = lnL(b;Y,X) | : : |g = lnL'(b;Y,X) | : : |H = lnL''(b;Y,X)| : : +---|------------+ : :.....|.....................................................................: | | +-------------------+ | | program you write | | | Calculate | +---------------------->| f = lnL(b;Y,X) |------> (return to caller) | g = lnL'(b;Y,X) | | H = lnL''(b;Y,X) | +-------------------+ (receives b, Y, X) (sends f, g, -H) -ml-'s -deriv2- code is purely an optimizer, and the calculation of f, g, and H is completely done in the program that you write. (Yes, -deriv2- requires that you return -H rather than H; more about that in future lectures.) Blowup of "Get f(b), g(b), and H(b)" for method -deriv1- -------------------------------------------------------- ............................................................................. : -ml- : : +---------------+ +-----------+ : : |Get | |Calc. | : : |f = lnL(b;Y,X) |--->|H = g'() | : : |g = lnL'(b;Y,X)| |numerically| : : +---|-----------+ +--|--------+ : :.....|...................|.................................................: | | | +----------------+ | | +-------------------+ | | | program you write | | +------------------->| Calculate |----> (return to caller) +---------------------->| f = lnL(b;Y,X) | | g = lnL'(b;Y,X) | +-------------------+ (receives b, Y, X) (sends f, g) -deriv1- makes your job a little easier; the program you write calculates the log-likelihood and the first derivatives (gradient vector). -deriv1- obtains H, the matrix of second derivatives numerically.

Blowup of "Get f(b), g(b), and H(b)" for method -deriv0- --------------------------------------------------------- ............................................................................. : -ml- : : +---------------+ +-----------+ +-----------+ : : |Get | |Calc. | |Calc. | : : |f = lnL(b;Y,X) |--->|g = f'() |--->|H = f''() | : : | | | |numerically| |numerically| : : +---|-----------+ +--|--------+ +--|--------+ : :.....|...................|................|................................: | | | | +----------------+ | | | +------------------------------+ | | | | | | +-------------------+ | | +---------------->| program you write | | +------------------->| Calculate |----> (return to caller) +---------------------->| f = lnL(b;Y,X) | +-------------------+ (receives b, Y, X) (sends f) -deriv0- requires your program to calculate only the log-likelihood. It obtains g, the gradient vector, and H, the matrix of second derivatives, numerically. Blowup of "Get f(b), g(b), and H(b)" for method -lf- ---------------------------------------------------- .............................................................................. : -ml- : : +-------+ +----------+ +-----------+ +-----------+ +-----------+ : : |Calc. | |Get f_j = | |Calc. | |Calc. | |Calc. | : : |I_j = |-->|f(y_j,I_j)|-->|f = |-->|g = f'() |-->|H = f''() | : : | x_j*b| | | | | Sum f_j()| |numerically| |numerically| : : +-------+ +--|-------+ +-----------+ +--|--------+ +--|--------+ : :................|..............................|...............|............: | | | | +---------------------------+ | | | +----------------------------------------+ | | | | | | +-------------------+ | | +-------->| program you write | | +----------->| Calculate |----> (return to caller) +-------------->| f_j = f(y_j, I_j) | +-------------------+ (receives y_j, I_j) (sends f_j) -lf- looks similar to -deriv0- -- you have to look carefully to see the differences.

The -lf- method --------------- As I already mentioned, -lf- is my favorite but it places two restrictions on the form of the likelihood function. First, -lf- requires that the physical observations in the dataset correspond to independent pieces of the likelihood function. That is, the general problem is max ln L(b;y,X) b and -lf- requires that you be able to write ln L(b;X) as ln L(b;X) = f(b;y_1,x_1) + f(b;y_2,x_2) + ... + f(b;y_n,x_n) where f(b;y_1,x_1) corresponds to the first observation in the dataset, f(b;y_2,x_2) the second observation, and so on. Actually, -lf- is willing to let you exclude observations with -if <exp>- and -in <range>-. The requirement is really that f() not be a function of more than one observation; that the likelihood of one observation can be calculated independently of all the other observations. This restriction is commonly (but not always) met in statistical analyses. It corresponds to Pr(dataset) = Pr(obs 1) * Pr(obs 2) * ... * Pr(obs _N) Anytime we have a likelihood function and we can write it down as "The log-likelihood for the jth observation is f_j = ..." and it is obvious that N "The overall log-likelihood function is ln L = Sum f_j" j=1 we have a likelihood meeting this restriction. The following are examples of likelihoods that DO NOT meet this criterion: the conditional (fixed-effects) logit model, Cox regression, and likelihoods for panel data (i.e., cross-sectional time-series models). Basically, any likelihood that explicitly models independent groups rather than independent observations does not meet the first -lf- criterion.

The second restriction required by -lf- --------------------------------------- The second restriction of -lf- is that the log-likelihood function for the jth observation is not just any function of x_j and b but that it is a function of x_j MULTIPLIED BY b (i.e., a linear form, hence the name -lf-): f_j = f(b;y_j,x_j) = f(y_j, x_j*b) By "*" or "multiplied by" I mean inner product, f_j = f(b;y_j,x_j) = f(y_j, x_j*b) = f(y_j, x_j1*b_1 + x_j2*b_2 + ... + x_jk*b_k) This might be better emphasized by writing the log-likelihood function as f_j = f(y_j,I_j) where I_j = x_j1*b1 + x_j2*b2 + ... + x_jk*b_k In fact, this is exactly how you want to think about it because the program -lf- requires you to write exactly corresponds to f(y_j,I_j). -ml-, in effect, passes to your program two arguments, y_j and I_j -- these are scalars such as the numbers 2 and 5 -- and expects you to return the corresponding f_j value. Actually, this second restriction is not much of a restriction, since -lf- allows multiple linear forms. More on this later in this lecture.

Approximate number of calls to the likelihood evaluation program required to obtain direction (approximate because search for optimal h is not counted) k lf deriv0 deriv1 deriv2 -------------------------------------------------------- 1 2 2 2 1 2 2 6 4 1 3 2 12 6 1 4 2 20 8 1 5 2 30 10 1 10 2 110 20 1 20 2 420 40 1 40 2 1,640 80 1 50 2 2,550 100 1 -------------------------------------------------------- So, -lf- is faster than -deriv0- and faster than -deriv1- once k>2. Do not expect total run times to vary in proportion the numbers, however, for two reasons: 1. Forty parameters, estimated by -deriv0-, do not take 1,640/2 = 820 times longer than when estimated by -lf-. Obtaining the derivatives is only one component of maximization step, albeit an important one. There is a component of -ml- that takes 820 times longer, but there are other components as well. 2. The table ignores the search for the optimal h and that, too, requires function calls. As well be explained when we discuss accuracy, we invest considerably more in the h calculation for -lf- than in either -deriv0- or -deriv1-.

Numerical root finding ---------------------- In one dimension (b and g(b) both scalars), one method of finding roots is Newton's method. This is an iterative technique. You have a guess b0 (called the initial value) and you update that as follows: line has slope g(b) g'(b0)=dg/db at b0 Newton's Method: | / ----------------------- | ./ 1. current guess is b0. | dots are g(b) ./ lines goes 2. calculate g(b0). | (connect them) ./ <-- through point 3. calculate g'(b0). | . /| (b0,g(b0)) 4. draw line through point | . / | (b0,g(b0)), slope g'(b0). | . / | 5. New guess is b1, point | . / | where line is zero. +-----.-------+----+------- b 6. Repeat. | . /b1 b0 | . / Formula for line: g'(b0)*(b -b0) + g(b0) g(b0) line at b1 == 0 : g'(b0)*(b1-b0) + g(b0) = 0 => b1 = b0 - ------ g'(b0) Now you have a new value b1 and you can repeat the process, g(b1) b2 = b1 - ------ g'(b1) and so on. The sequence is not guaranteed to converge but it generally does. Newton's Method for finding roots can be converted to finding minimums and maximums by substituting f'() for g(): To find b such that f(b) is maximized, 1. start with a guess b0. f'(b0) 2. calculate a new guess b1, b1 = b0 - ------ 3. repeat. f''(b0) We can generalize this to allow b to be a vector: To find vector b such that f(b) is maximized, 1. start with a guess b0. -1 2. calculate a new guess b1, b1 = b0 - g H , where g is the gradient vector and H the matrix of second derivatives (also known as the Hessian). 3. repeat.

Numerical derivatives --------------------- This maximizer that I have sketched has substantial analytic requirements: In addition to specifying f(b) = ln L(b;X), the routine needs to be able to calculate g(b) = df/db and H(b) = d2f/db2. It needs the function, its first derivatives, and its second derivatives. Stata provides facilities so that you do not actually have to program the derivative calculations but really, Stata has only one optimizer. When you do not provide the derivatives, Stata calculates them numerically. The definition of an analytic derivative is df | f(x0+h) - f(x0) f'(x0) = -- | = lim --------------- dx | h->0 h |x0 and that leads to a simple approximation formula, f(x0+h) - f(x0) f'(x0) is approximately --------------- for a suitably small h but large enough h. I have a lot to say about how Stata finds a suitably small but large enough h, and about variations on this formula, but put all that aside. The fact that Stata uses a formula like [f(x0+h)-f(x0)]/h has an important implication for your program that calculates f(), the log likelihood function. If Stata chooses h just right, in general this formula will only be about half as accurate as the routine that calculates f(). (It won't be less accurate than that and it might be more accurate). The loss of accuracy comes about because of the subtraction in the numerator.

Let me show you: let's consider calculating the derivative of exp(x) at x=2. The true answer is exp(2) because d(exp(x))/dx = exp(x). Let's try this formula with h=1e-8 and I will carefully do this calculation with 16 digits of accuracy: exp(2 + 1e-8) = 7.389056172821211 (accuracy is 16 digits) exp(2) = 7.389056098930650 (accuracy is 16 digits) --------------------------------------- difference = .000000073890561 exp(2) = 7.389056098930650 (true answer) difference / 1e-8 = 7.389056033701991 (approximation formula) --------------------------------------- error .000000065228659 approximation is correct to 1 2345678 8 digits The major source of error was introduced when we calculated the difference exp(2 + 1e-8) - exp(2). exp(2 + 1e-8) = 7.389056172821211 (accuracy is 16 digits) exp(2) = 7.389056098930650 (accuracy is 16 digits) --------------------------------------- difference = .000000073890561 12345678 (accuracy is 8 digits) Our full 16 digits of accuracy was lost since half the digits of exp(2 + 1e-8) and exp(2) were in common and so canceled. This is an unpleasant feature of numerical derivatives. Done right, if you start with k digits of accuracy, you will fall to k/2 digits at worst. (You can get lucky, the best case being when f(x0)=0 in which case you can get all k digits back, ignoring the inaccuracy introduced by the algorithm itself.) Thus, as a programmer of likelihood functions, if you are going to use numerical derivatives, it is vitally important that you supply ln L() as accurately as you can. This means all your -generate- statements should explicitly specify double: . gen double ... = ...

The other issue in calculating numerical derivatives is choosing h. Stata does that for you. Just so that you are properly appreciative, let's consider that problem: If Stata chooses a value for h that is too small, f(x0+h) will numerically equal f(x0) and then the approximation to f'(x0) will be zero. Values larger than that, but still too small, result in poor accuracy due to the numerical roundoff error associated with subtraction that we have just seen. If f(x0+h) is very nearly equal f(x0), nearly all the digits are in common and the difference has few significant digits. For instance, say f(x) and f(x+h) differ only in the 15th and 16th digits: f(x+h) = x.xxxxxxxxxxxxxab f(x) = x.xxxxxxxxxxxxxcd --------------------------------------- difference = 0.0000000000000ef (2 digits of accuracy) Even worse, the 16 digit numbers shown might be accurate to only 14 digits themselves. In that case, the difference f(x+h)-f(x) would be purely noise. On the other hand, if Stata chooses value for h that is too large, the approximation formula itself will not estimate the derivative accurately.

Choosing the right value of h is of great importance. Let me show you. A minute ago we considered numerically evaluating the derivative of exp(x) at x=2. Below I do that I calculate the error for a variety of h's: . drop _all . input h h 1. 1e-20 2. 1e-19 3. 1e-18 <output omitted> 19. 1e-2 20. 1e-1 21. 1 22. end . gen double approx = (exp(2+h)-exp(2))/h . gen double error = exp(2) - approx . gen double relerr = error/exp(2) . list h approx error relerr 1. 1.00e-20 0 7.3890561 1 2. 1.00e-19 0 7.3890561 1 3. 1.00e-18 0 7.3890561 1 4. 1.00e-17 0 7.3890561 1 5. 1.00e-16 0 7.3890561 1 6. 1.00e-15 6.2172489 1.1718072 .15858686 7. 1.00e-14 7.5495167 -.1604606 -.02171598 8. 1.00e-13 7.3807628 .0082933 .00112238 9. 1.00e-12 7.3896445 -.00058838 -.00007963 10. 1.00e-11 7.3890228 .00003334 4.512e-06 11. 1.00e-10 7.3890582 -2.057e-06 -2.783e-07 12. 1.00e-09 7.3890567 -5.878e-07 -7.956e-08 13. 1.00e-08 7.3890561 2.032e-08 2.750e-09 14. 1.00e-07 7.3890565 -3.636e-07 -4.920e-08 15. 1.00e-06 7.3890598 -3.694e-06 -5.000e-07 16. 1.00e-05 7.389093 -.00003695 -5.000e-06 17. .0001 7.3894256 -.00036947 -.00005 18. .001 7.3927519 -.00369576 -.00050017 19. .01 7.4261248 -.03706874 -.00501671 20. .1 7.7711381 -.38208204 -.05170918 21. 1 12.696481 -5.3074247 -.71828183

In comparing these alternative h's, you should focus on the relerr column. The minimum relative error in this experiment is recorded in observation 13, h = 1e-8. So, should h be 1e-8? Not in general. 1e-8 is a good choice for calculating the derivative of exp() at 2, but at some other location, a different value of h would be best. Even if you map out the entire exp() function, you will not have solved the problem. Change the function and you change the best value of h. J. C. Nash (1990, 219) Compact Numerical Methods for Computers: Linear Algebra and Function Minimisation, 2d ed., New York: Adam Hilger, suggested that one choose h so that x+h differs from x in at least half of its digits (the least significant ones). In particular, Nash suggested the formula h(x) = e*(x+e) where e is the square root of machine precision. Machine precision is roughly 1e-16 for double precision and therefore h(2) = 2e-8. This way of dynamically adjusting h to the value of x works pretty well. Bill Sribney and I worked on this problem and came to the conclusion that Nash's suggested solution could be improved upon. The are two issues that require balancing: 1) If calculation were carried out infinite precision, the smaller is h, the more accurately [f(x+h)-f(x)]/h would approximate the derivative. 2) Taking into consideration finite precision, the closer are the values f(x+h) and f(x), the fewer are the significant digits in the calculation f(x+h)-f(x). Thus, we use a variation on Nash's suggestion and control the numerator: f(x+h) and f(x) should differ in about the half digits. That is, we adopt the rule of setting h as small as possible subject to the constraint that f(x+h)-f(x) will be calculated to at least half accuracy. This is a computationally expensive rule to implement because, each time a numerical derivative is to be calculated, the program must search for the optimal value of h meaning that f(x+h) must be calculated for each trial value. The payoff, however, is that -ml- is remarkably robust. More than once we have been told by a user of having a miserable experience attempting maximizing a function using some other package and that they had to grope around for just the right starting values. They tried the function using -ml- and there was no problem no matter what the starting values.

Why?, they wondered. Well, now you know. It is the numerical derivative calculation. The cost of this robustness, however, is that -ml- uses more computer time to get to the result. -ml- also uses a centered derivative calculation. Rather than using f(x0+h) - f(x0) f'(x0) is approximately --------------- h -ml- uses f(x0+h/2) - f(x0-h/2) f'(x0) is approximately --------------------- h This also ups the computational time required but results in an important improvement.

Numerical second derivatives and the accuracy of estimated variance ------------------------------------------------------------------- Remember that Newton-Raphson not only requires f'(x0), it requires f''(x0), too, the second derivatives. We use the same method for calculating this as we use for calculating the first derivatives, f'(x0+h/2) - f'(x0-h/2) f''(x0) is approximately ----------------------- h So, if you make the substitution for f'(x0), you obtain: f(x0+h) - 2*f(x0) + f(x0-h) f''(x0) is approximately --------------------------- h*h I want you to think about the accuracy of this. If f() is calculated to 16 decimal places, f'() will be accurate to 16/2 decimal places (at worst). If f'() is accurate to 8 decimal places, f''() will be accurate to 8/2 decimal places (at worst). Thus, f''() is accurate to 4 decimal places (at worst). So that is the accuracy of your standard errors. If you start with 16 digits of accuracy, they will be good to 4 digits -- maybe more if you are lucky -- but maybe a bit less due to the fact that the approximation formula for the calculation would still be only an approximation if executed at infinite precision. Figure 4 digits. If your routine that calculates the likelihood function is good to 12 digits, your standard errors are good to 3 digits. If your routine is good to 8 digits, your standard errors are good to 2 digits. If you want more accuracy than that, you must code the analytic derivatives.

The -ml- method -lf- checklist ------------------------------ 1. Write your likelihood-evaluation program. In what follows I will assume you name it -myll-. The outline for the program is: program define myll local f "`1'" local I "`2'" optional -> local I2 "`3'" ... optional -> parse "$S_mldepn", parse(" ") local y1 "`1'" local y2 "`2'" ... optional -> tempvar work1 work2 ... gen double `work1' = ... gen double `work2' = ... replace `f' = ... end 2. Define your equations: . eq myeq: ... . eq myeq2: ... <- optional . ... 3. Give the -ml- commands to setup the problem: . ml begin . ml function myll . ml method lf . ml model b = myeq [myeq2 ...] [, depv(...) cons(...)] . ml sample mysamp [if <exp>] 4. Search for initial values (optional) . ml search, limits(myeq # # [myeq2 # # ...]) 5. Perform the optimization . ml maximize f V 6. Post the results . ml post myest . ml mlout myest

The -ml- method -lf- do-file ---------------------------- I use do-files when I have to work with -ml-. There are just too many places where I can make some silly mistake. Although in most cases I can just back up and fix it, I do not really feel comfortable unless the whole thing runs without error from start to finish. So here is how I organize my do-file: ----------------------------- top --- sample do-file -------- capture log close log using somefile, replace program drop _all set more off clear use such-and-such /* load data */ eq myeq : ... /* define equations first */ eq ... program define myll /* define program */ ... end ml begin ml function myll ml method lf ml model b = myeq ... ml sample mysamp ml search, limits(<#> <#> or myeq # # ...) ml maximize f V ml post myest ml mlout myest log close ----------------------------- end --- sample do-file -------- Here is what I would like you to notice about my do-file: 1. When I run it, I am certain that there is no remnant of some previous problem laying around that will cause problems and confuse me. I -program drop _all-. I -clear- (which gets rid of any old equations.) I reload the data on which I intend to run. 2. I set the specific problem to be estimated at the top and after that include the -program define- and -ml- code. This way, I can easily change the specific model (mpg on weight and displ) easily.