FAQ: Methods for obtaining marginal effects

Home / Resources & support / FAQs / Methods for obtaining marginal effects

Note: This FAQ is for Stata 10 and older versions of Stata.

In Stata 11, the margins command replaced mfx.

What is the difference between the linear and nonlinear methods that mfx uses?

Title		Methods for obtaining marginal effects
Author		May Boggess, StataCorp

After an estimation, the command mfx calculates marginal effects. A marginal effect of an independent variable x is the partial derivative, with respect to x, of the prediction function f specified in the mfx command’s predict option. If no prediction function is specified, the default prediction for the preceding estimation command is used. This derivative is evaluated at the values of the independent variables specified in the at option of the mfx command; if no values are specified, it is evaluated at the default values, which are the means of the independent variables. If there were any offsets in the preceding estimation, the derivative is evaluated at the means of the offset variables.

The derivative is calculated numerically by mfx, meaning that it approximates the derivative by using the following formula with an appropriately small h:

                             df      lim     f(x+h) − f(x)
  Marginal effect of x  =   ----  =        -----------------
                             dx      h->0          h

In the above equation, I have been a bit lazy. I wrote the prediction function f with only one argument. This may allow us to forget that f is a function of all the independent variables in the model—let’s call them x_i, i=1...p−1—and their coefficients, b_i, i=1...p−1. We'll call the constant term b_p, and, for reasons that will become clear later on, I will define x_p=1. So, for f(x) I write

  f(x_1, ... ,x_i, ..., x_p, b_1, b_2, ..., b_p)

Using this notation would remind us that we are taking a partial derivative, so that all the other variables are being held constant, each x at its mean, and the coefficients at the values estimated by the previous estimation command. (Using this notation would also make the formulas long and cumbersome!)

A formula for the standard error of the marginal effect is obtained using the delta method

 Standard error of marginal effect of x = square root of D_x' * V * D_x

where V is the variance–covariance matrix from the estimation and D_x is the column vector whose jth entry is the partial derivative of the marginal effect of x, with respect to the coefficient of the jth independent variable:

          d      df
  D_xj = ----- ( --- )
          db_j   dx

Thus, to compute a single standard error, we must compute the derivative of the marginal effect with respect to each coefficient in the model.

Computing the derivative of a function f with respect to a variable x can be time consuming because an iterative algorithm must be used to seek out an appropriate change in x (the h). The mfx command can avoid this type of iteration in two situations.

The first is when the variable x is not continuous but is a dummy variable; in other words, the values of x can only be 0 or 1. In that case, asking for the value of the derivative as the average value of x doesn’t make a lot sense. Instead, mfx computes the slope of the line between f(0) and f(1). In other words, for a dummy variable it uses h=1. Let’s verify this with an example:

. sysuse auto
(1978 automobile data)

. mlogit rep78 mpg foreign, nolog

Multinomial logistic regression                         Number of obs =     69
                                                        LR chi2(8)    =  35.56
                                                        Prob > chi2   = 0.0000
Log likelihood = -75.91106                              Pseudo R2     = 0.1898


 
       rep78   Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
1              
         mpg     .0827881   .1432153     0.58   0.563    -.1979087    .3634849
     foreign    -14.30099   2342.179    -0.01   0.995    -4604.888    4576.286
       _cons    -4.257887    3.07217    -1.39   0.166    -10.27923    1.763455

2              
         mpg      .001712   .0903742     0.02   0.985    -.1754182    .1788423
     foreign    -13.93023   1120.041    -0.01   0.990    -2209.171    2181.311
       _cons    -1.249118   1.774027    -0.70   0.481    -4.726146    2.227911

3               (base outcome)                                                

4              
         mpg      .039967   .0701494     0.57   0.569    -.0975234    .1774574
     foreign     2.059856   .8017627     2.57   0.010     .4884301    3.631282
       _cons    -1.877983   1.432903    -1.31   0.190    -4.686422    .9304561

5              
         mpg     .1804401   .0870028     2.07   0.038     .0099178    .3509623
     foreign     3.048564   1.038079     2.94   0.003     1.013966    5.083162
       _cons    -6.451597   2.135118    -3.02   0.003    -10.63635   -2.266843



. mfx, predict(p outcome(1)) varlist(foreign)

Marginal effects after mlogit
      y  = Pr(rep78==1) (predict, p outcome(1))
         =  .00057038


 
variable        dy/dx    Std. err.     z    P>|z|  [    95% C.I.   ]      X
   
 foreign*    -.0455262      .03156   -1.44   0.149  -.107388  .016336   .304348

(*) dy/dx is for discrete change of dummy variable from 0 to 1

. summarize mpg


    Variable          Obs        Mean    Std. dev.       Min        Max
    
         mpg           74     21.2973    5.785503         12         41


. replace mpg=r(mean)
variable mpg was int now float
(74 real changes made)

. replace foreign=0
(22 real changes made)

. predict p0, p outcome(1)

. replace foreign=1
(74 real changes made)

. predict p1, p outcome(1)

. display _n "my marginal effect for foreign = " p1 - p0

my marginal effect for foreign = -.04554773

Computing the standard errors of marginal effects of dummy variables is not as efficient because it requires taking derivatives with respect to the coefficients, and the coefficients are continuous variables.

The second case in which it is easy to get a derivative is when the function is linear. Remember that the slope of a straight line is “rise over run”, and it doesn’t matter what you use for run; it also doesn’t matter which two points on the line you use to get the rise over run, because the line is straight.

As I said above, our prediction function depends on the variables and their coefficients. If we find that the function f depends on these in a particular way, we can get a marked simplification.

We may find that the only way the x’s and b’s appear in the formula for f is in the sum

`p`
`Σ`	`x_i b_i`
`i=1`

As an example, we can use the predicted probability of success following logistic:

                                                    exp(sum_i x_i*b_i)
  f(x_1, ... ,x_i, ..., x_p, b_1, b_2, ..., b_p)= -------------------------
                                                   1 + exp(sum_i x_i*b_i)

Let’s see this at work:

. sysuse auto, clear
(1978 automobile data)

. logistic for mpg turn

Logistic regression                                     Number of obs =     74
                                                        LR chi2(2)    =  38.66
                                                        Prob > chi2   = 0.0000
Log likelihood = -25.703368                             Pseudo R2     = 0.4292


 
     foreign   Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
   
         mpg     .9275081   .0645275    -1.08   0.279     .8092803    1.063008
        turn     .5365028   .0831728    -4.02   0.000     .3959243    .7269957
       _cons     4.42e+10   3.00e+11     3.61   0.000     74613.63    2.62e+16

Note: _cons estimates baseline odds.

. predict p, p

. gen myp= exp(_b[mpg]*mpg + _b[turn] *turn + _b[_cons])/(1+exp(_b[mpg]* 
> mpg + _b[turn] *turn + _b[_cons]))

. summarize p myp


    Variable          Obs        Mean    Std. dev.       Min        Max
    
           p           74    .2972973    .3159947   .0002891   .9721519
         myp           74    .2972973    .3159947   .0002891   .9721519

In this example, our prediction function was

  f(mpg,turn,1,_b[mpg], _b[turn],_b[_cons])

                  exp(_b[mpg]*mpg + _b[turn] *turn + _b[_cons])
             = ---------------------------------------------------
                1 + exp(_b[mpg]*mpg + _b[turn] *turn + _b[_cons])

In general, we express this condition as

  f(x_1, ... ,x_i, ..., x_p, b_1, b_2, ..., b_p) =  f(sum_i x_i*b_i)

This is called the “linear method condition”.

How would this condition not be satisfied? You would have to treat some independent variables differently than others. This can occur in a multiple-equation model. Let’s say we have two equations, calling the variables in the first equation x with coefficients b and those in the second z with coefficients c. Then the linear condition would be

  f(x_1, ..., x_p, b_1, ..., b_p, z_1, ..., z_q, c_1, ..., c_q) 
             =  f(sum_i x_i*b_i, sum_j z_j*c_j)

where the x’s and z’s have no variables in common. Let’s look at an example using biprobit:

. webuse school, clear

. biprobit (private years) (vote logptax loginc), nolog

Seemingly unrelated bivariate probit                    Number of obs =     95
                                                        Wald chi2(3)  =   7.61
Log likelihood = -90.161121                             Prob > chi2   = 0.0547


 
               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
private        
       years    -.0188994   .0250596    -0.75   0.451    -.0680154    .0302166
       _cons    -1.103314   .2531021    -4.36   0.000    -1.599385   -.6072429
   
vote           
     logptax    -1.166334    .557388    -2.09   0.036    -2.258795   -.0738738
      loginc      1.12626   .4385527     2.57   0.010     .2667124    1.985807
       _cons    -2.813394   3.676768    -0.77   0.444    -10.01973    4.392939
   
     /athrho    -.2851457   .2398178    -1.19   0.234    -.7551801    .1848886
   
         rho    -.2776609   .2213289                     -.6382291    .1828103

LR test of rho=0: chi2(1) = 1.48789                        Prob > chi2 = 0.2225

.  mfx, predict(p10) tracelvl(1)

calculating dydx (linear method)


 
variable        dy/dx
   
   years        -.00174
 logptax          .0489
  loginc        -.04722



calculating standard errors (linear method)

years ... continuous
   Step:  1  2  3  4  5  6 
years: Std. err. = .00241497

logptax ... continuous
   Step:  1  2  3  4  5  6 
logptax: Std. err. = .02786338

loginc ... continuous
   Step:  1  2  3  4  5  6 
loginc: Std. err. = .02342405

Marginal effects after biprobit
      y  = Pr(private=1,vote=0) (predict, p10)
         =  .05831027


 
variable        dy/dx    Std. err.     z    P>|z|  [    95% C.I.   ]      X
   
   years    -.0017351      .00241   -0.72   0.472  -.006468  .002998   8.51579
 logptax     .0489027      .02786    1.76   0.079  -.005708  .103514    6.9395
  loginc    -.0472225      .02342   -2.02   0.044  -.093133 -.001312   9.97102

In this example, there is one x, years, and two z's, logptax and loginc. There are no variables in common in the equations, and the formulas for the prediction function contain only the sums. The tracelvl(1) option shows that mfx used the linear method.

There is an example in which one variable is in both equations:

. biprobit (private logptax years) (vote logptax loginc), nolog

Seemingly unrelated bivariate probit                    Number of obs =     95
                                                        Wald chi2(4)  =   7.65
Log likelihood = -90.13665                              Prob > chi2   = 0.1052


 
               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
private        
     logptax     .1241868   .5630383     0.22   0.825     -.979348    1.227722
       years    -.0176921   .0256939    -0.69   0.491    -.0680513    .0326671
       _cons    -1.977172   3.971164    -0.50   0.619     -9.76051    5.806167
   
vote           
     logptax    -1.178059    .560322    -2.10   0.036     -2.27627   -.0798483
      loginc    -1.178059    .560322    -2.10   0.036     -2.27627   -.0798483
       _cons    -2.725785   3.698396    -0.74   0.461    -9.974508    4.522938
   
     /athrho    -.2821851   .2397268    -1.18   0.239     -.752041    .1876709
   
         rho    -.2749262   .2216072                      -.636365    .1854982

LR test of rho=0: chi2(1) = 1.45829                        Prob > chi2 = 0.2272

. mfx, predict(xb1) tracelvl(1)

calculating dydx (nonlinear method)


 
variable        dy/dx
   
 logptax         .12419
   years        -.01769
  loginc              0



calculating standard errors (nonlinear method)

logptax ... continuous
 1  2  3  4  5  6  7 
logptax: Std. err. = .56303852

years ... continuous
 1  2  3  4  5  6  7 
years: Std. err. = .02569392

loginc ... continuous
 1  2  3  4  5  6  7 
loginc: Std. err. = 0

Marginal effects after biprobit
      y  = Linear prediction of private (predict, xb1)
         = -1.2660402


 
variable        dy/dx    Std. err.     z    P>|z|  [    95% C.I.   ]      X
   
 logptax     .1241868      .56304    0.22   0.825  -.979348  1.22772    6.9395
   years    -.0176921      .02569   -0.69   0.491  -.068051  .032667   8.51579
  loginc            0           0       .       .         0        0   9.97102



. predict xb1,xb1

. gen myxb1=_b[private:logptax]*logptax + _b[private:years]*years+
> _b[private:_cons]

. summarize xb1 myxb1


    Variable          Obs        Mean    Std. dev.       Min        Max
    
         xb1           95    -1.26604    .1854367   -2.00642  -1.081714
       myxb1           95    -1.26604    .1854367   -2.00642  -1.081714

Now, you might think that with the predict option xb1, mfx surely would have used the linear method. But, remember we have x_1=logptax, x_2=years, z_1=logptax, and z_2=loginc. The prediction function is

  xb1 =  f(x_1, ..., x_p, b_1, ..., b_p, z_1, ..., z_q, c_1, ..., c_q) 
      =  f(x_1, x_2, 1, b_1, b_2, b_3, z_1, z_2, 1, c_1, c_2, c_3) 
      =  x_1*b_1 + x_2*b_2 + b_3
      =  logptax*b_1 + years*b_2 + b_3
      =  z_1*b_1 + + x_2*b_2 + b_3,

where b_3 is the constant from equation 1. The above example shows that z_1 is appearing in this formula not as part of the sum: sum_j(z_j*c_j) (because z_1 is appearing without c_1).

How does mfx tell the linear from the nonlinear conditions? It looks for variables that are common to multiple equations because, if that happens, even the simplest prediction (i.e., xb) will not satisfy the linear condition.

As you have seen in the above examples, using the option tracelvl(1) shows you, among other things, which method mfx is using, either linear or nonlinear. Now, we will go through the details of the two methods.

Nonlinear method

As we said at the top, for the marginal effect of x, mfx uses

                             df      lim     f(x+h) - f(x)
  Marginal effect of x  =   ----  =        -----------------
                             dx      h->0          h

It takes an initial guess at a good value for h, and then, if the value is not good enough, it iterates (making h larger or smaller) until it finds an h that is good enough. Here “good enough” means it will produce the desired level of accuracy.

For the standard error, we have the formula from the delta method:

 Standard error of marginal effect of x = square root of D_x' * V * D_x

           d     df
  D_xj = ----- ( --- )
          db_j   dx

Each of these second derivatives is calculated in the same way. We take an initial guess at a good, small change in b_j, and, if that doesn’t meet the accuracy criteria, we iterate, making it smaller or larger, until it does.

If we use the option tracelvl(2) on mfx, we can watch the calculation of each of these second derivatives as it occurs:

. webuse school, clear

. biprobit (private logptax years) (vote logptax loginc), nolog

Seemingly unrelated bivariate probit                    Number of obs =     95
                                                        Wald chi2(4)  =   7.65
Log likelihood = -90.13665                              Prob > chi2   = 0.1052


 
               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
private        
     logptax     .1241868   .5630383     0.22   0.825     -.979348    1.227722
       years    -.0176921   .0256939    -0.69   0.491    -.0680513    .0326671
       _cons    -1.977172   3.971164    -0.50   0.619     -9.76051    5.806167
   
vote           
     logptax    -1.178059    .560322    -2.10   0.036     -2.27627   -.0798483
      loginc     1.125638   .4386281     2.57   0.010     .2659429    1.985333
       _cons    -2.725785   3.698396    -0.74   0.461    -9.974508    4.522938
   
     /athrho    -.2821851   .2397268    -1.18   0.239     -.752041    .1876709
   
         rho    -.2749262   .2216072                      -.636365    .1854982

LR test of rho=0: chi2(1) = 1.45829                        Prob > chi2 = 0.2272

. mfx, predict(xb1) tracelvl(2)

calculating dydx (nonlinear method)


 
variable        dy/dx
   
 logptax         .12419
   years        -.01769
  loginc              0



calculating standard errors (nonlinear method)

logptax ... continuous
 1  d^2f/dxdb = 1.0000004
 2  d^2f/dxdb = 0
 3  d^2f/dxdb = 0
 4  d^2f/dxdb = 0
 5  d^2f/dxdb = 0
 6  d^2f/dxdb = 0
 7  d^2f/dxdb = 0
logptax: Std. err. = .56303852

years ... continuous
 1  d^2f/dxdb = 0
 2  d^2f/dxdb = .99999923
 3  d^2f/dxdb = 0
 4  d^2f/dxdb = 0
 5  d^2f/dxdb = 0
 6  d^2f/dxdb = 0
 7  d^2f/dxdb = 0
years: Std. err. = .02569392

loginc ... continuous
 1  d^2f/dxdb = 0
 2  d^2f/dxdb = 0
 3  d^2f/dxdb = 0
 4  d^2f/dxdb = 0
 5  d^2f/dxdb = 0
 6  d^2f/dxdb = 0
 7  d^2f/dxdb = 0
loginc: Std. err. = 0

Marginal effects after biprobit
      y  = Linear prediction of private (predict, xb1)
         = -1.2660402


 
variable        dy/dx    Std. err.     z    P>|z|  [    95% C.I.   ]      X
   
 logptax     .1241868      .56304    0.22   0.825  -.979348  1.22772    6.9395
   years    -.0176921      .02569   -0.69   0.491  -.068051  .032667   8.51579
  loginc            0           0       .       .         0        0   9.97102

The result here is quite predictable. Since our prediction function is xb1, we would expect the coefficients and their standard errors as the answers. The second derivatives all turn out to be zeros and ones, so pre- and postmultiplying the covariance matrix V by D_x (in the standard error formula) is simply picking the relevant entry out of the diagonal of V.

Linear method

To figure out which formulas mfx is using for prediction functions that satisfy the linear condition, let’s start with the easiest case, a single equation estimation. So, say we have variables x with coefficients b. (As I did above, I am saying the constant is the coefficient of a variable that is equal to one.) Then, the linear condition would be

  f(x_1, ..., x_p, b_1, ..., b_p) =  f(sum_i x_i*b_i)

which every prediction function for a single-equation estimator is going to satisfy. In the interests of trimming down our notation, write

  xb =   sum_i x_i*b_i

I can use the chain rule to get the derivative of f with respect to any of my x’s:

                              df       df    d(xb)      df     
  Marginal effect of x_i  =  ----  =  ---- * -----  =  ---- * b_i
                             dx_i     d(xb)   dx_i     d(xb)

The same df/d(xb) is used for every x_i. We only actually have to work out one derivative, df/d(xb), and then we get all the marginal effects simply by multiplying it by the appropriate coefficient. For this reason, the linear method is much faster than the nonlinear.

mfx calculates df/d(xb) in the usual way, iterating to find h, calculating (f(x+h)−f(x))/h), and then multiplying it by b_i.

Let’s look at an example:

. sysuse auto, clear
(1978 automobile data)

. regress mpg weight length


      Source         SS           df       MS      Number of obs   =        74
      F(2, 71)        =     69.34
       Model    1616.08062         2  808.040312    Prob > F        =    0.0000
    Residual    827.378835        71   11.653223    R-squared       =    0.6614
      Adj R-squared   =    0.6519
       Total    2443.45946        73  33.4720474    Root MSE        =    3.4137


 
         mpg   Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
   
      weight    -.0038515    .001586    -2.43   0.018    -.0070138   -.0006891
      length    -.0795935   .0553577    -1.44   0.155    -.1899736    .0307867
       _cons     47.88487    6.08787     7.87   0.000       35.746    60.02374


. mfx, predict(xb) nose tracelvl(2)

calculating dydx (linear method)

equation i= 1 : df/d(xb) = 1


 
variable        dy/dx
   
  weight        -.00385
  length        -.07959



Marginal effects after regress
      y  = Linear prediction (predict, xb)
         =  21.297297


 
                        variable            dy/dx                 X
   
                          weight         -.0038515            3019.46
                          length         -.0795935            187.932

df/d(xb) turned out to be 1. That's no surprise because f(xb)=xb. Let’s try something more interesting.

The prediction function we will use in this next example is the probability of success from a logistic regression:

               exp(xb)  
  f(xb) =   --------------
             1 + exp(xb)

Differentiating this with respect to xb using the quotient rule gives

   df             exp(xb)  
  ----   =   -----------------
  d(xb)       (1 + exp(xb))^2

Let’s see how this checks out:

. sysuse auto
(1978 automobile data)

. logit foreign weight length, nolog

Logistic regression                                     Number of obs =     74
                                                        LR chi2(2)    =  31.99
                                                        Prob > chi2   = 0.0000
Log likelihood = -29.040272                             Pseudo R2     = 0.3551


 
     foreign   Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
      weight    -.0028419   .0016763    -1.70   0.090    -.0061273    .0004436
      length     .0089197   .0542959     0.16   0.870    -.0974983    .1153376
       _cons     5.366227    5.77534     0.93   0.353    -5.953233    16.68569
 


. mfx, predict(p) nose tracelvl(2)

calculating dydx (linear method)

equation i= 1 : df/d(xb) = .14552699


 
variable        dy/dx
   
  weight        -.00041
  length          .0013



Marginal effects after logit
      y  = Pr(foreign) (predict, p)
         =  .17677715


 
                        variable            dy/dx                 X
   
                          weight         -.0004136            3019.46
                          length          .0012981            187.932



. quietly summarize weight

. quietly replace wei=r(mean)

. quietly summarize length

. quietly replace len=r(mean)

. list wei len in 1


      weight    length  
 1.  3,019.5   187.932  


. predict xb, xb

. gen my_dfdxb= exp(xb)*(1+exp(xb))^(-2)

. list my_dfdxb in 1


     my_dfdxb 
 1.   .145527 


. display " marg. eff wei = " _b[wei]*my_dfdxb
 marg. eff wei = -.00041357

. display " marg. eff len = " _b[len]*my_dfdxb
 marg. eff len = .00129805

In the next example, we show that the same calculations are used for multiple equations with no variables in common. In this case, we compute df/d(xb) for each equation:

. sysuse auto
(1978 automobile data)

. reg3 (mpg weight) (mpg length)

Three-stage least-squares regression

 
Equation             Obs   Params         RMSE  "R-squared"      chi2   P>chi2
 
mpg                   74        1      3.72811      0.5791      90.04   0.0000
2mpg                  74        1     3.800181      0.5626      83.04   0.0000
 



 
               Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
mpg            
      weight    -.0040049   .0004221    -9.49   0.000    -.0048321   -.0031777
       _cons     33.38998    1.33402    25.03   0.000     30.77535    36.00461
   
2mpg           
      length    -.1377196   .0151131    -9.11   0.000    -.1673407   -.1080984
       _cons     47.17927   2.868909    16.45   0.000     41.55631    52.80223

Endogenous variables: mpg 
 Exogenous variables: weight length 

. mfx, predict(xb equation(#1)) nose tracelvl(2)

calculating dydx (linear method)

equation i= 1 : df/d(xb) = 1

equation i= 2 : df/d(xb) = 0


 
variable        dy/dx
   
  weight          -.004
  length              0



Marginal effects after reg3
      y  = Linear prediction (predict, xb equation(#1))
         =  21.297297


 
                        variable            dy/dx                 X
   
                          weight         -.0040049            3019.46
                          length                 0            187.932

Here, the prediction function is

  f(weight, 1, _b[mpg:weight],  _b[mpg:_cons], length, _b[2mpg:length], 
            1, _b[2mpg:_cons])
            = _b[mpg:weight]*weight + _b[mpg:_cons].

For the first equation, df/d(xb)=1. For the second equation, df/d(xb)=0 because the sum (_b[2mpg:length]*length + _b[2mpg:_cons]) does not appear in the prediction function. If that's correct, when we use the second equation prediction, it should be the other way around:

. mfx, predict(xb equation(#2)) nose tracelvl(2)

calculating dydx (linear method)

equation i= 1 : df/d(xb) = 0

equation i= 2 : df/d(xb) = 1


 
variable        dy/dx
   
  weight              0
  length        -.13772



Marginal effects after reg3
      y  = Linear prediction (predict, xb equation(#2))
         =  21.297297


 
                        variable            dy/dx                 X
   
                          weight                 0            3019.46
                          length         -.1377196            187.932

Now let’s move onto the standard errors. Again, let’s start with a single-equation estimator. Say we have variables x with coefficients b, and we are trying to calculate the second derivatives:

           d      df
  D_ij = ----- ( --- )
          db_j   dx_i

If i is not equal to j, using the chain rule, we have

           d      df       d      df      
  D_ij = ----- ( --- ) = ----- ( ---- * b_i )
          db_j   dx_i     db_j   d(xb)              

                  d      df
       = b_i *  ----- ( ---- )
                 db_j   d(xb)

                  d       df       d(xb)
       = b_i *  -----  ( ---- ) * -------
                 d(xb)   d(xb)     db_j

                  d       df       
       = b_i *  -----  ( ---- ) * x_j
                 d(xb)   d(xb)     

                       d       df       
       = b_i * x_j * -----  ( ---- ) 
                     d(xb)   d(xb)

If i is equal to j, we don't get out of it so easily on the first line. We’ll need to use the product rule there:

           d      df       d      df      
  D_ii = ----- ( --- ) = ----- ( ---- * b_i )
          db_i   dx_i     db_i   d(xb)              

            df                  d      df  
       =   ----  * 1  +  b_i * ---- ( ---- )
           d(xb)               db_i   d(xb) 

                       d      df         df 
       = b_i * x_i * ----- ( ---- )  +  ----  
                     d(xb)   d(xb)      d(xb)

We already have df/d(xb) calculated from before (the marginal effect calculation), so all we have to do is the second derivative, d^2f/d(xb)d(xb). This is calculated by mfx using the chain rule:

   d     df        d       df     
  ---- ( --- ) = -----  ( ---- ) * b_i
  dx_i  d(xb)     d(xb)   d(xb)

    d      df         d     df 
  ----- ( ---- )  = ----- ( ---- ) /b_i 
  d(xb)   d(xb)      dx_i  d(xb)

This is true for any x_i, so mfx uses x_1. It finds the derivative with respect to x_1 by iterating to find h, calculating (f(x+h)−f(x))/h, and then dividing it by b_1.

Time for an example:

. logit for headroom trunk, nolog

Logistic regression                                     Number of obs =     74
                                                        LR chi2(2)    =  10.60
                                                        Prob > chi2   = 0.0050
Log likelihood = -39.73265                              Pseudo R2     = 0.1177


 
     foreign   Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
   
    headroom    -.2969306   .4487154    -0.66   0.508    -1.176397    .5825354
       trunk    -.1708735   .0906542    -1.88   0.059    -.3485526    .0068055
       _cons     2.222753   1.084936     2.05   0.040     .0963174    4.349189



. mfx, predict(xb) tracelvl(2)

calculating dydx (linear method)

equation i= 1 : df/d(xb) = 1


 
variable        dy/dx
   
headroom        -.29693
   trunk        -.17087



calculating standard errors (linear method)

equation k = 1: 
k= 1, l= 1: d^2f/d(xb_k)d(xb_l) = 0

headroom ... continuous
 1   d^2f/dxdb = 1
 2   d^2f/dxdb = 0
 3   d^2f/dxdb = 0
headroom: Std. err. = .44871539

trunk ... continuous
 1   d^2f/dxdb = 0
 2   d^2f/dxdb = 1
 3   d^2f/dxdb = 0
trunk: Std. err. = .09065423

Marginal effects after logit
      y  = Linear prediction (log odds) (predict, xb)
         =  -1.016698


 
variable        dy/dx    Std. err.     z    P>|z|  [    95% C.I.   ]      X
   
headroom    -.2969306      .44872   -0.66   0.508   -1.1764  .582535   2.99324
   trunk    -.1708735      .09065   -1.88   0.059  -.348553  .006806   13.7568

Here the prediction function is

  f(headroom, trunk, 1, _b[headroom], _b[trunk], _b[_cons])
          = _b[headroom]*headroom + _b[trunk]*trunk + _b[_cons]

We can see that df/d(xb)=1, and by this, I mean df/d(xb) is a constant function. When we take the derivative of df/d(xb) with respect to any x_i, we get zero. When we substitute this into the above formulas for D_ij, we will get 1 when i=j and zero otherwise.

Now let’s turn to multiple equations. To unravel the formulas, we’ll just stick with two equations. Say we call the variables in the first equation x with coefficients b and the second we’ll call z with coefficients c.

For variable i, which occurs in equation k, we must work out a second derivative for each of the other variables. Let’s say we want to do this for variable j, which occurs in equation l. There are several possibilities with all these indices: either the variables are in the same equation or not. If they are in the same equation, either they are equal or not.

Let’s say that i and j are both in the first equation, and i is not equal to j:

         d      df        d      df       d(xb_1)    df     d(xb_2) 
D_ij = ----- ( ---  ) = ----- ( ------ *  ------  + ----- *  ------)
       db_j    dx_i      db_j   d(xb_1)   dx_i     d(xb_2)    dx_i  

          d      df      
     =  ----- ( ------ *  b_i )
         db_j   d(xb_1)

since x_i does not appear in the second equation, making d(xb_2)/dx_i=0. Applying the chain rule again results in:

         d      df      
D_ij = ----- ( ------ *  b_i )
      db_j   d(xb_1)  

                 d        df         d(xb_1)     d        df       d(xb_2) 
     = b_i *(  -------  ( ------ )  * -------  + -----  ( ----- ) * ------- )
               d(xb_1)   d(xb_1)      db_j      d(xb_2) d(xb_1)      db_j  

                d          df       
     = b_i *  -------   ( ----- ) * x_j
               d(xb_1)   d(xb_1)     

                     d         df       
     = b_i * x_j * -------  ( ----- ) 
                   d(xb_1)   d(xb_1)

If i and j are both in the first equation and i is equal to j, we would need the product rule and would end up with

         d      df      
D_ii = ----- ( ------ *  b_i )
        db_i   d(xb_1)  

         df                 d        df        d(xb_1)    
     = ----- * 1 + b_i * (------- ( ------ ) * ------- 
       d(xb_1)            d(xb_1)   d(xb_1)     db_j      

                     d        df       d(xb_2)
                 + ------  ( ----- ) * -------)
                   d(xb_2)  d(xb_1)      db_j

        df                 d          df       
     = ------  +  b_i *  -------  ( ----- ) * x_j
        d(xb_1)          d(xb_1)   d(xb_1)     

                     d        df         df 
     = b_i * x_i * -----   ( ---- )  +  ---- 
                   d(xb_1)  d(xb_1)    d(xb_1)

Now, let’s suppose that i is in equation 1 and j is in equation 2. I’ll use c_j for its coefficient to remind us.

         d      df        d      df       d(xb_1)    df     d(xb_2) 
D_ij = ----- ( ---  ) = ----- ( ------ *  ------  + ----- *  ------)
       dc_j    dx_i      dc_j   d(xb_1)    dx_i     d(xb_2)    dx_i         

          d       df      
     =  ----- ( ------ *  b_i )
         dc_j   d(xb_1)      

                 d          df        d(xb_1)     d        df       d(xb_2) 
     = b_i * ( -------  ( ------ )  * -------  + -----  ( ----- ) * ------- )
               d(xb_1)   d(xb_1)      dc_j      d(xb_2)  d(xb_1)     dc_j  

                 d       df       
     = b_i *  ------- ( ----- ) * z_j
              d(xb_2)  d(xb_1)     

                     d         df       
     = b_i * z_j * -------  ( ----- )
                   d(xb_2)   d(xb_1)

If we have many equations, it is a lot of work to compute all the second derivatives: d^2f/d(xb_k)d(xb_l), where k, l = 1...s, and s is the number of equations. We can save effort on this because the order of differentiation doesn’t make any difference, meaning:

    d         df          d       df 
  -------  ( ----- )  = ------  ( ---- )
  d(xb_k)   d(xb_l)     d(xb_l) d(xb_k)

We only need to actually go through the calculation for l≤k.

Let’s finish off with a multiple-equation example:

. webuse labor, clear
 
. gen wc=(we>12)
 
. quietly treatreg ww wa cit, treat(wc=wmed wfed) nolog
  
. matrix list e(b)

e(b)[1,9]
            ww:         ww:         ww:         ww:         wc:         wc:
            wa         cit          wc       _cons        wmed        wfed
y1  -.01104239   .12763602   1.2713275   2.3186382   .11980547   .09618863

            wc:     athrho:    lnsigma:
         _cons       _cons       _cons
y1  -2.6318761   .01786677   .92415838

. mfx, predict(ptrt) varlist(wmed wfed) tracelvl(2)

calculating dydx (linear method)

equation i= 1 : df/d(xb) = 0

equation i= 2 : df/d(xb) = .31020517

equation i= 3 : df/d(xb) = 0

equation i= 4 : df/d(xb) = 0


 
variable        dy/dx
   
    wmed         .03716
    wfed         .02984



calculating standard errors (linear method)

equation k = 1: 
k= 1, l= 1: d^2f/d(xb_k)d(xb_l) = 0

equation k = 2: 
k= 2, l= 1: d^2f/d(xb_k)d(xb_l) = 0
k= 2, l= 2: d^2f/d(xb_k)d(xb_l) = .22004123

equation k = 3: 
k= 3, l= 1: d^2f/d(xb_k)d(xb_l) = 0
k= 3, l= 2: d^2f/d(xb_k)d(xb_l) = 0
k= 3, l= 3: d^2f/d(xb_k)d(xb_l) = 0

equation k = 4: 
k= 4, l= 1: d^2f/d(xb_k)d(xb_l) = 0
k= 4, l= 2: d^2f/d(xb_k)d(xb_l) = 0
k= 4, l= 3: d^2f/d(xb_k)d(xb_l) = 0
k= 4, l= 4: d^2f/d(xb_k)d(xb_l) = 0

wmed ... continuous
 1   d^2f/dxdb = 0
 2   d^2f/dxdb = 0
 3   d^2f/dxdb = 0
 4   d^2f/dxdb = 0
 5   d^2f/dxdb = .55104971
 6   d^2f/dxdb = .22692532
 7   d^2f/dxdb = .02636214
 8   d^2f/dxdb = 0
 9   d^2f/dxdb = 0
wmed: Std. err. = .00981917

wfed ... continuous
 1   d^2f/dxdb = 0
 2   d^2f/dxdb = 0
 3   d^2f/dxdb = 0
 4   d^2f/dxdb = 0
 5   d^2f/dxdb = .19336768
 6   d^2f/dxdb = .49239749
 7   d^2f/dxdb = .02116546
 8   d^2f/dxdb = 0
 9   d^2f/dxdb = 0
wfed: Std. err. = .00903345

Marginal effects after treatreg
      y  = Pr(wc) (predict, ptrt)
         =  .23905623


 
variable        dy/dx    Std. err.     z    P>|z|  [    95% C.I.   ]      X
   
    wmed     .0371643      .00982    3.78   0.000   .017919   .05641     9.136
    wfed     .0298382      .00903    3.30   0.001   .012133  .047543     8.608

Looking at the coefficient matrix stored after the estimation, we see there are four equations and nine coefficients. The last two equations contain only constants, so we expect all their derivatives to be zero.

In computing the marginal effects, we see that df/d(xb) is zero for the first equation, which makes sense because the prediction function I specified is the probability of treatment. We also see that df/d(xb) is a nonzero for the second equation, as we would expect.

  df/d(xb_2) = .31020517

In computing the standard errors, we see that the second derivatives are only calculated for l≤k, as expected, and that the only nonzero values occur in equation 2:

  d^2f/d(xb_2)d(xb_2)  = .22004165

Let’s check that the nonzero second derivatives agree with our formulas:

                                         d        df           df 
 D(wmed, wmed) = _b[wc:wmed] * wmed * -------  ( ----- )  +  ------  
                                      d(xb_2)   d(xb_2)      d(xb_2)
 
               = .1198055 * 9.136 *.22004165  + .31020517
    
               = .55105

This is the coefficient of wmed multiplied by the mean of wmed and then multiplied by d^2f/d(xb_2)d(xb_2) plus df/d(xb_2). Here are the remaining nonzero derivatives:

                                         d         df
 D(wmed, wfed) =  _b[wc:wmed] * wfed * -----    ( ----- )    
                                      d(xb_2)    d(xb_2)      

               = .1198055 *  8.608 * .22004165 

               = .22692582

                                         d         df          
D(wmed, _cons) = _b[wc:wmed] * _cons * -----    ( ----- )
                                      d(xb_2)    d(xb_2)      

               = .1198055  * 1  * .22004165 

               = .0263622
                              
                                         d        df    
 D(fmed, wmed) = _b[wc:wfed] * wmed * -------  ( ----- )   
                                      d(xb_2)   d(xb_2)     

               = .0961886 * 9.136 *.22004165  
       
               = .19336

                                         d         df            df 
 D(fmed, wfed) =  _b[wc:wfed] * wfed * -----    ( ----- )   +  ------   
                                      d(xb_2)    d(xb_2)       d(xb_2)

               = .0961886 *  8.608 * .22004165  + .31020517

               = .492397

                                         d          df          
D(fmed, _cons) = _b[wc:wfed] * _cons * -----    ( ------ )    
                                      d(xb_2)     d(xb_2)      

               = .0961886  * 1  * .22004165 

               = .0211655

What is the difference between the linear and nonlinear methods that mfx uses?

Nonlinear method

Linear method

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies


rep78		Coefficient Std. err. z P>\|z\| [95% conf. interval]

1
mpg		.0827881 .1432153 0.58 0.563 -.1979087 .3634849
foreign		-14.30099 2342.179 -0.01 0.995 -4604.888 4576.286
_cons		-4.257887 3.07217 -1.39 0.166 -10.27923 1.763455

2
mpg		.001712 .0903742 0.02 0.985 -.1754182 .1788423
foreign		-13.93023 1120.041 -0.01 0.990 -2209.171 2181.311
_cons		-1.249118 1.774027 -0.70 0.481 -4.726146 2.227911

3		(base outcome)

4
mpg		.039967 .0701494 0.57 0.569 -.0975234 .1774574
foreign		2.059856 .8017627 2.57 0.010 .4884301 3.631282
_cons		-1.877983 1.432903 -1.31 0.190 -4.686422 .9304561

5
mpg		.1804401 .0870028 2.07 0.038 .0099178 .3509623
foreign		3.048564 1.038079 2.94 0.003 1.013966 5.083162
_cons		-6.451597 2.135118 -3.02 0.003 -10.63635 -2.266843


variable		dy/dx Std. err. z P>\|z\| [ 95% C.I. ] X

foreign*		-.0455262 .03156 -1.44 0.149 -.107388 .016336 .304348

Variable		Obs Mean Std. dev. Min Max

mpg		74 21.2973 5.785503 12 41


foreign		Odds ratio Std. err. z P>\|z\| [95% conf. interval]

mpg		.9275081 .0645275 -1.08 0.279 .8092803 1.063008
turn		.5365028 .0831728 -4.02 0.000 .3959243 .7269957
_cons		4.42e+10 3.00e+11 3.61 0.000 74613.63 2.62e+16

Variable		Obs Mean Std. dev. Min Max

p		74 .2972973 .3159947 .0002891 .9721519
myp		74 .2972973 .3159947 .0002891 .9721519


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

private
years		-.0188994 .0250596 -0.75 0.451 -.0680154 .0302166
_cons		-1.103314 .2531021 -4.36 0.000 -1.599385 -.6072429

vote
logptax		-1.166334 .557388 -2.09 0.036 -2.258795 -.0738738
loginc		1.12626 .4385527 2.57 0.010 .2667124 1.985807
_cons		-2.813394 3.676768 -0.77 0.444 -10.01973 4.392939

/athrho		-.2851457 .2398178 -1.19 0.234 -.7551801 .1848886

rho		-.2776609 .2213289 -.6382291 .1828103

Variable		Obs Mean Std. dev. Min Max

xb1		95 -1.26604 .1854367 -2.00642 -1.081714
myxb1		95 -1.26604 .1854367 -2.00642 -1.081714

Source	SS df MS	Number of obs = 74
		F(2, 71) = 69.34
Model	1616.08062 2 808.040312	Prob > F = 0.0000
Residual	827.378835 71 11.653223	R-squared = 0.6614
		Adj R-squared = 0.6519
Total	2443.45946 73 33.4720474	Root MSE = 3.4137


mpg		Coefficient Std. err. t P>\|t\| [95% conf. interval]

weight		-.0038515 .001586 -2.43 0.018 -.0070138 -.0006891
length		-.0795935 .0553577 -1.44 0.155 -.1899736 .0307867
_cons		47.88487 6.08787 7.87 0.000 35.746 60.02374


variable		dy/dx X

weight		-.0038515 3019.46
length		-.0795935 187.932


Equation Obs Params RMSE "R-squared" chi2 P>chi2

mpg 74 1 3.72811 0.5791 90.04 0.0000
2mpg 74 1 3.800181 0.5626 83.04 0.0000


		Coefficient Std. err. z P>\|z\| [95% conf. interval]

mpg
weight		-.0040049 .0004221 -9.49 0.000 -.0048321 -.0031777
_cons		33.38998 1.33402 25.03 0.000 30.77535 36.00461

2mpg
length		-.1377196 .0151131 -9.11 0.000 -.1673407 -.1080984
_cons		47.17927 2.868909 16.45 0.000 41.55631 52.80223


foreign		Coefficient Std. err. z P>\|z\| [95% conf. interval]

headroom		-.2969306 .4487154 -0.66 0.508 -1.176397 .5825354
trunk		-.1708735 .0906542 -1.88 0.059 -.3485526 .0068055
_cons		2.222753 1.084936 2.05 0.040 .0963174 4.349189


variable		dy/dx Std. err. z P>\|z\| [ 95% C.I. ] X

wmed		.0371643 .00982 3.78 0.000 .017919 .05641 9.136
wfed		.0298382 .00903 3.30 0.001 .012133 .047543 8.608

Stata/MP4 Annual License (download)

What is the difference between the linear and nonlinear methods that mfx uses?

Nonlinear method

Linear method

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies