|
Note: This FAQ is for Stata 10 and older versions of Stata. In Stata 11,
the margins command replaced mfx.
What is the difference between the linear and nonlinear methods that mfx uses?
| Title |
|
Methods for obtaining marginal effects |
| Author |
May Boggess, StataCorp
|
| Date |
March 2004; minor revisions October 2005; minor revisions August 2007 |
After an estimation, the command mfx calculates
marginal effects. A marginal effect of an independent variable x is the
partial derivative, with respect to x, of the prediction function f
specified in the mfx command’s
predict option. If no prediction function is
specified, the default prediction for the preceding estimation command is
used. This derivative is evaluated at the values of the independent
variables specified in theat option of the
mfx command; if no values are specified, it is
evaluated at the default values, which are the means of the independent
variables. If there were any offsets in the preceding estimation, the
derivative is evaluated at the means of the offset variables.
The derivative is calculated numerically by mfx,
meaning that it approximates the derivative by using the following formula
with an appropriately small h:
df lim f(x+h) − f(x)
Marginal effect of x = ---- = -----------------
dx h->0 h
In the above equation, I have been a bit lazy. I wrote the prediction
function f with only one argument. This may allow us to forget that f is a
function of all the independent variables in the model—let’s call them
x_i, i=1...p−1—and their coefficients, b_i, i=1...p−1.
We'll call the constant term b_p, and, for reasons that will become clear
later on, I will define x_p=1. So, for f(x) I write
f(x1, ... ,x_i, ..., x_p, b_1, b_2, ..., b_p)
Using this notation would remind us that we are taking a partial derivative,
so that all the other variables are being held constant, each x at its mean,
and the coefficients at the values estimated by the previous estimation
command. (Using this notation would also make the formulas long and
cumbersome!)
A formula for the standard error of the marginal effect is obtained using
the delta method
Standard error of marginal effect of x = D_x' * V * D_x
where V is the variance–covariance matrix from the estimation and D_x
is the column vector whose jth entry is the second partial derivative of the
marginal effect of x, with respect to the coefficient of the jth independent
variable:
d df
D_xj = ----- ( --- )
db_j dx
Thus, to compute a single standard error, we must compute the derivative of
the marginal effect with respect to each coefficient in the model.
Computing the derivate of a function f with respect to a variable x can be
time consuming because an iterative algorithm must be used to seek out an
appropriate change in x (the h). The mfx command
can avoid this type of iteration in two situations.
The first is when the variable x is not continuous but is a dummy variable;
in other words, the values of x can only be 0 or 1. In that case, asking for
the value of the derivative as the average value of x doesn’t make a
lot sense. Instead, mfx computes the
slope of the line between f(0) and f(1). In other words, for a dummy
variable it uses h=1. Let’s verify this with an example:
. sysuse auto
(1978 Automobile Data)
. mlogit rep78 mpg foreign, nolog
Multinomial logistic regression Number of obs = 69
LR chi2(8) = 35.56
Prob > chi2 = 0.0000
Log likelihood = -75.911059 Pseudo R2 = 0.1898
------------------------------------------------------------------------------
rep78 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
1 |
mpg | .0827863 .1432122 0.58 0.563 -.1979045 .3634771
foreign | -33.90739 4.24e+07 -0.00 1.000 -8.30e+07 8.30e+07
_cons | -4.257833 3.072114 -1.39 0.166 -10.27907 1.763401
-------------+----------------------------------------------------------------
2 |
mpg | .0017101 .0903729 0.02 0.985 -.1754176 .1788377
foreign | -33.5392 2.03e+07 -0.00 1.000 -3.98e+07 3.98e+07
_cons | -1.249072 1.774006 -0.70 0.481 -4.72606 2.227917
-------------+----------------------------------------------------------------
4 |
mpg | .0399522 .0701507 0.57 0.569 -.0975406 .177445
foreign | 2.059772 .8017571 2.57 0.010 .4883572 3.631187
_cons | -1.877722 1.432919 -1.31 0.190 -4.686192 .930748
-------------+----------------------------------------------------------------
5 |
mpg | .1804501 .087006 2.07 0.038 .0099215 .3509788
foreign | 3.04875 1.038099 2.94 0.003 1.014113 5.083388
_cons | -6.452018 2.135251 -3.02 0.003 -10.63703 -2.267004
------------------------------------------------------------------------------
(rep78==3 is the base outcome)
. mfx, predict(p outcome(1)) varlist(foreign)
Marginal effects after mlogit
y = Pr(rep78==1) (predict, p outcome(1))
= 1.465e-06
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
foreign*| -.0455277 .03156 -1.44 0.149 -.107391 .016335 .304348
------------------------------------------------------------------------------
(*) dy/dx is for discrete change of dummy variable from 0 to 1
. summarize mpg
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
mpg | 74 21.2973 5.785503 12 41
. replace mpg=r(mean)
mpg was int now float
(74 real changes made)
. replace foreign=0
(22 real changes made)
. predict p0, p outcome(1)
. replace foreign=1
(74 real changes made)
. predict p1, p outcome(1)
. display _n "my marginal effect for foreign = " p1 - p0
my marginal effect for foreign = -.04554925
Computing the standard errors of marginal effects of dummy variables is not
as efficient because it requires taking derivatives with respect to the
coefficients, and the coefficients are continuous variables.
The second case in which it is easy to get a derivative is when the function
is linear. Remember that the slope of a straight line is “rise over
run”, and it doesn’t matter what you use for run; it also
doesn’t matter which two points on the line you use to get the rise
over run, because the line is straight.
As I said above, our prediction function depends on the variables and their
coefficients. If we find that the function f depends on these in a
particular way, we can get a marked simplification.
We may find that the only way the x’s and b’s appear in the
formula for f is in the sum
As an example, we can use the predicted probability of success following
logistic:
exp(sum_i x_i*b_i)
f(x1, ... ,x_i, ..., x_p, b_1, b_2, ..., b_p)= -------------------------
1 + exp(sum_i x_i*b_i)
Let’s see this at work:
. sysuse auto, clear
(1978 Automobile Data)
. logistic foreign mpg turn
Logistic regression Number of obs = 74
LR chi2(2) = 38.66
Prob > chi2 = 0.0000
Log likelihood = -25.703368 Pseudo R2 = 0.4292
------------------------------------------------------------------------------
foreign | Odds Ratio Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg | .9275081 .0645274 -1.08 0.279 .8092804 1.063008
turn | .5365028 .0831724 -4.02 0.000 .3959249 .7269946
------------------------------------------------------------------------------
. predict p, p
. gen myp= exp(_b[mpg]*mpg + _b[turn] *turn + _b[_cons])/(1+exp(_b[mpg]*mpg +
> _b[turn] * turn + _b[_cons]))
. summarize p myp
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
p | 74 .2972973 .3159947 .0002891 .9721519
myp | 74 .2972973 .3159947 .0002891 .9721519
In this example, our prediction function was
f(mpg,turn,1,_b[mpg], _b[turn],_b[_cons])
exp(_b[mpg]*mpg + _b[turn] *turn + _b[_cons])
= ---------------------------------------------------
1 + exp(_b[mpg]*mpg + _b[turn] *turn + _b[_cons])
In general, we express this condition as
f(x1, ... ,x_i, ..., x_p, b_1, b_2, ..., b_p) = f(sum_i x_i*b_i)
This is called the “linear method condition”.
How would this condition not be satisfied? You would have to treat some
independent variables differently than others. This can occur in a
multiple-equation model. Let’s say we have two equations,
calling the variables in the first equation x with coefficients b and those
in the second z with coefficients c. Then the linear condition would be
f(x_1, ..., x_p, b_1, ..., b_p, z_1, ..., z_q, c_1, ..., c_q)
= f(sum_i x_i*b_i, sum_j z_j*c_j)
where the x’s and z’s have no variables in common. Let’s
look at an example using
biprobit:
. webuse school, clear
. biprobit (private years) (vote logptax loginc), nolog
Seemingly unrelated bivariate probit Number of obs = 95
Wald chi2(3) = 7.61
Log likelihood = -90.161121 Prob > chi2 = 0.0547
------------------------------------------------------------------------------
| Coef. Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
private |
years | -.0188994 .0250596 -0.75 0.451 -.0680154 .0302166
_cons | -1.103314 .2531021 -4.36 0.000 -1.599385 -.6072429
-------------+----------------------------------------------------------------
vote |
logptax | -1.166334 .557388 -2.09 0.036 -2.258795 -.0738738
loginc | 1.12626 .4385527 2.57 0.010 .2667124 1.985807
_cons | -2.813394 3.676768 -0.77 0.444 -10.01973 4.392939
-------------+----------------------------------------------------------------
/athrho | -.2851457 .2398178 -1.19 0.234 -.7551801 .1848886
-------------+----------------------------------------------------------------
rho | -.2776609 .2213289 -.6382291 .1828103
------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chi2(1) = 1.48789 Prob > chi2 = 0.2225
. mfx, predict(p10) tracelvl(1)
calculating dydx (linear method)
-------------------------
variable | dy/dx
---------+---------------
years | -.00174
logptax | .0489
loginc | -.04722
-------------------------
calculating standard errors (linear method)
years ... continuous
Step: 1 2 3 4 5 6
years: Std. Err. = .00241497
logptax ... continuous
Step: 1 2 3 4 5 6
logptax: Std. Err. = .02786333
loginc ... continuous
Step: 1 2 3 4 5 6
loginc: Std. Err. = .023424
Marginal effects after biprobit
y = Pr(private=1,vote=0) (predict, p10)
= .05831027
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z p>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
years | -.0017351 .00241 -0.72 0.472 -.006468 .002998 8.51579
logptax | .0489027 .02786 1.76 0.079 -.005708 .103514 6.9395
loginc | -.0472225 .02342 -2.02 0.044 -.093133 -.001312 9.97102
------------------------------------------------------------------------------
In this example, there is one x, years, and two z's,
logptax and
loginc. There are no variables in common
in the equations, and the formulas for the prediction function contain only
the sums. The tracelvl(1) option shows that
mfx used the linear method.
There is an example in which one variable is in both equations:
. biprobit (private logptax years) (vote logptax loginc), nolog
Seemingly unrelated bivariate probit Number of obs = 95
Wald chi2(4) = 7.65
Log likelihood = -90.13665 Prob > chi2 = 0.1052
------------------------------------------------------------------------------
| Coef. Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
private |
logptax | .1241868 .5630383 0.22 0.825 -.979348 1.227722
years | -.0176921 .0256939 -0.69 0.491 -.0680513 .0326671
_cons | -1.977172 3.971164 -0.50 0.619 -9.76051 5.806167
-------------+----------------------------------------------------------------
vote |
logptax | -1.178059 .560322 -2.10 0.036 -2.27627 -.0798483
loginc | 1.125638 .4386281 2.57 0.010 .2659429 1.985333
_cons | -2.725785 3.698396 -0.74 0.461 -9.974508 4.522938
-------------+----------------------------------------------------------------
/athrho | -.282185 .2397268 -1.18 0.239 -.752041 .1876709
-------------+----------------------------------------------------------------
rho | -.2749262 .2216072 -.636365 .1854982
------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chi2(1) = 1.45829 Prob > chi2 = 0.2272
. mfx, predict(xb1) tracelvl(1)
calculating dydx (nonlinear method)
-------------------------
variable | dy/dx
---------+---------------
logptax | .12419
years | -.01769
loginc | 0
-------------------------
calculating standard errors (nonlinear method)
logptax ... continuous
1 2 3 4 5 6 7
logptax: Std. Err. = .56303838
years ... continuous
1 2 3 4 5 6 7
years: Std. Err. = .02569392
loginc ... continuous
1 2 3 4 5 6 7
loginc: Std. Err. = 0
Marginal effects after biprobit
y = linear prediction of private (predict, xb1)
= -1.2660402
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z p>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
logptax | .1241868 .56304 0.22 0.825 -.979348 1.22772 6.9395
years | -.0176921 .02569 -0.69 0.491 -.068051 .032667 8.51579
loginc | 0 0 . . 0 0 9.97102
------------------------------------------------------------------------------
. predict xb1,xb1
. gen myxb1=_b[private:logptax]*logptax +
> _b[private:years]*years+_b[private:_cons]
. summarize xb1 myxb1
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
xb1 | 95 -1.26604 .1854367 -2.00642 -1.081714
myxb1 | 95 -1.26604 .1854367 -2.00642 -1.081714
Now, you might think that with the predict option xb1,
mfx surely would have used the linear method. But,
remember we have x_1=logptax, x_2=years, z_1=logptax,
and z_2=loginc. The prediction function is
xb1 = f(x_1, ..., x_p, b_1, ..., b_p, z_1, ..., z_q, c_1, ..., c_q)
= f(x_1, x_2, 1, b_1, b_2, b_3, z_1, z_2, 1, c_1, c_2, c_3)
= x_1*b_1 + x_2*b_2
= logptax*b_1 + years*b_2 + b_3
= z_1*b_1 + + x_2*b_2 + b_3,
where b_3 is the constant from equation 1. The above example shows that z_1
is appearing in this formula not as part of the sum: sum_j(z_j*c_j) (because
z_1 is appearing without c_1).
How does mfx tell the linear from the nonlinear
conditions? It looks for variables that are common to multiple equations
because, if that happens, even the simplest prediction (i.e.,
xb) will not satisfy the linear condition.
As you have seen in the above examples, using the option
tracelvl(1) shows you, among other things,
which method mfx is using, either linear or
nonlinear. Now, we will go through the details of the two methods.
Nonlinear method
As we said at the top, for the marginal effect of x,
mfx uses
df lim f(x+h) - f(x)
Marginal effect of x = ---- = -----------------
dx h->0 h
It takes an initial guess at a good value for h, and then, if the value is
not good enough, it iterates (making h larger or smaller) until it finds an
h that is good enough. Here “good enough” means it will
produce the desired level of accuracy.
For the standard error, we have the formula from the delta method:
Standard error of marginal effect of x = D_x' * V * D_x
where V is the variance–;covariance matrix from the estimation and D_x
is the column vector whose jth entry is the second partial derivative of the
marginal effect of x with respect to the coefficient of the jth independent
variable:
d df
D_xj = ----- ( --- )
db_j dx
Each of these second derivatives is calculated in the same way. We take an
initial guess at a good, small change in b_j, and, if that doesn’t meet the
accuracy criteria, we iterate, making it smaller or larger, until it does.
If we use the option tracelvl(2) on
mfx, we can watch the calculation of each of these
second derivatives as it occurs:
. webuse school, clear
. biprobit (private logptax years) (vote logptax loginc), nolog
Seemingly unrelated bivariate probit Number of obs = 95
Wald chi2(4) = 7.65
Log likelihood = -90.13665 Prob > chi2 = 0.1052
------------------------------------------------------------------------------
| Coef. Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
private |
logptax | .1241868 .5630383 0.22 0.825 -.979348 1.227722
years | -.0176921 .0256939 -0.69 0.491 -.0680513 .0326671
_cons | -1.977172 3.971164 -0.50 0.619 -9.76051 5.806167
-------------+----------------------------------------------------------------
vote |
logptax | -1.178059 .560322 -2.10 0.036 -2.27627 -.0798483
loginc | 1.125638 .4386281 2.57 0.010 .2659429 1.985333
_cons | -2.725785 3.698396 -0.74 0.461 -9.974508 4.522938
-------------+----------------------------------------------------------------
/athrho | -.282185 .2397268 -1.18 0.239 -.752041 .1876709
-------------+----------------------------------------------------------------
rho | -.2749262 .2216072 -.636365 .1854982
------------------------------------------------------------------------------
Likelihood-ratio test of rho=0: chi2(1) = 1.45829 Prob > chi2 = 0.2272
. mfx, predict(xb1) tracelvl(2)
calculating dydx (nonlinear method)
-------------------------
variable | dy/dx
---------+---------------
logptax | .12419
years | -.01769
loginc | 0
-------------------------
calculating standard errors (nonlinear method)
logptax ... continuous
1 d^2f/dxdb = .99999988
2 d^2f/dxdb = 0
3 d^2f/dxdb = 0
4 d^2f/dxdb = 0
5 d^2f/dxdb = 0
6 d^2f/dxdb = 0
7 d^2f/dxdb = 0
logptax: Std. Err. = .56303824
years ... continuous
1 d^2f/dxdb = 0
2 d^2f/dxdb = 1.0000011
3 d^2f/dxdb = 0
4 d^2f/dxdb = 0
5 d^2f/dxdb = 0
6 d^2f/dxdb = 0
7 d^2f/dxdb = 0
years: Std. Err. = .02569397
loginc ... continuous
1 d^2f/dxdb = 0
2 d^2f/dxdb = 0
3 d^2f/dxdb = 0
4 d^2f/dxdb = 0
5 d^2f/dxdb = 0
6 d^2f/dxdb = 0
7 d^2f/dxdb = 0
loginc: Std. Err. = 0
Marginal effects after biprobit
y = linear prediction of private (predict, xb1)
= -1.2660402
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
logptax | .1241868 .56304 0.22 0.825 -.979348 1.22772 6.9395
years | -.0176921 .02569 -0.69 0.491 -.068051 .032667 8.51579
loginc | 0 0 . . 0 0 9.97102
------------------------------------------------------------------------------
The result here is quite predictable. Since our prediction function is
xb1, we would expect the coefficients and
their standard errors as the answers. The second derivatives all turn out
to be zeros and ones, so pre- and postmultiplying the covariance matrix V by
D_x (in the standard error formula) is simply picking the relevant entry out
of the diagonal of V.
Linear method
To figure out which formulas mfx is using for
prediction functions that satisfy the linear condition, let’s start with the
easiest case, a single equation estimation. So, say we have variables x
with coefficients b. (As I did above, I am saying the constant is the
coefficient of a variable that is equal to one.) Then, the linear condition
would be
f(x_1, ..., x_p, b_1, ..., b_p) = f(sum_i x_i*b_i)
which every prediction function for a single-equation estimator is going to
satisfy. In the interests of trimming down our notation, write
xb = sum_i x_i*b_i
I can use the chain rule to get the derivative of f with respect to any of
my x’s:
df df d(xb) df
Marginal effect of x_i = ---- = ---- * ----- = ---- * b_i
dx_i d(xb) dx_i d(xb)
The same df/d(xb) is used for every x_i. We only actually have to work out
one derivative, df/d(xb), and then we get all the marginal effects simply by
multiplying it by the appropriate coefficient. For this reason, the linear
method is much faster than the nonlinear.
mfx calculates df/d(xb) in the usual way, iterating
to find h, calculating (f(x+h)−f(x))/h), and then multiplying it by
b_1.
Let’s look at an example:
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg weight length
Source | SS df MS Number of obs = 74
-------------+------------------------------ F( 2, 71) = 69.34
Model | 1616.08062 2 808.040312 Prob > F = 0.0000
Residual | 827.378835 71 11.653223 R-squared = 0.6614
-------------+------------------------------ Adj R-squared = 0.6519
Total | 2443.45946 73 33.4720474 Root MSE = 3.4137
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t p>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0038515 .001586 -2.43 0.018 -.0070138 -.0006891
length | -.0795935 .0553577 -1.44 0.155 -.1899736 .0307867
_cons | 47.88487 6.08787 7.87 0.000 35.746 60.02374
------------------------------------------------------------------------------
. mfx, predict(xb) nose tracelvl(2)
calculating dydx (linear method)
equation i= 1 : df/d(xb) = 1
-------------------------
variable | dy/dx
---------+---------------
weight | -.00385
length | -.07959
-------------------------
Marginal effects after regress
y = Linear prediction (predict, xb)
= 21.297297
-------------------------------------------------------------------------------
variable | dy/dx X
---------------------------------+---------------------------------------------
weight | -.0038515 3019.46
length | -.0795935 187.932
-------------------------------------------------------------------------------
df/d(xb) turned out to be 1. That's no surprise because f(xb)=xb.
Let’s try something more interesting.
The prediction function we will use in this next example is the probability
of success from a logistic regression:
exp(xb)
f(xb) = --------------
1 + exp(xb)
Differentiating this with respect to xb using the quotient rule gives
df exp(xb)
---- = -----------------
d(xb) (1 + exp(xb))^2
Let’s see how this checks out:
. logit foreign weight length, nolog
Logistic regression Number of obs = 74
LR chi2(2) = 31.99
Prob > chi2 = 0.0000
Log likelihood = -29.040272 Pseudo R2 = 0.3551
------------------------------------------------------------------------------
foreign | Coef. Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0028419 .0016763 -1.70 0.090 -.0061273 .0004436
length | .0089197 .0542959 0.16 0.870 -.0974983 .1153376
_cons | 5.366227 5.77534 0.93 0.353 -5.953232 16.68568
------------------------------------------------------------------------------
. mfx, predict(p) nose tracelvl(2)
calculating dydx (linear method)
equation i= 1 : df/d(xb) = .14552699
-------------------------
variable | dy/dx
---------+---------------
weight | -.00041
length | .0013
-------------------------
Marginal effects after logit
y = Pr(foreign) (predict, p)
= .17677715
-------------------------------------------------------------------------------
variable | dy/dx X
---------------------------------+---------------------------------------------
weight | -.0004136 3019.46
length | .0012981 187.932
-------------------------------------------------------------------------------
. quietly summarize weight
. quietly replace weight=r(mean)
. quietly summarize length
. quietly replace length=r(mean)
. list weight length in 1
+-------------------+
| weight length |
|-------------------|
1. | 3,019.5 187.932 |
+-------------------+
. predict xb, xb
. gen my_dfdxb= exp(xb)*(1+exp(xb))^(-2)
. list my_dfdxb in 1
+----------+
| my_dfdxb |
|----------|
1. | .145527 |
+----------+
. display " marg. eff weight = " _b[weight]*my_dfdxb
marg. eff weight = -.00041357
. display " marg. eff length = " _b[length]*my_dfdxb
marg. eff length = .00129805
In the next example, we show that the same calculations are used for
multiple equations with no variables in common. In this case, we compute
df/d(xb) for each equation:
. sysuse auto, clear
(1978 Automobile Data)
. reg3 (mpg weight) (mpg length)
Three-stage least-squares regression
----------------------------------------------------------------------
Equation Obs Parms RMSE "R-sq" chi2 P
----------------------------------------------------------------------
mpg 74 1 3.72811 0.5791 90.04 0.0000
2mpg 74 1 3.800181 0.5626 83.04 0.0000
----------------------------------------------------------------------
------------------------------------------------------------------------------
| Coef. Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg |
weight | -.0040049 .0004221 -9.49 0.000 -.0048321 -.0031777
_cons | 33.38998 1.33402 25.03 0.000 30.77535 36.00461
-------------+----------------------------------------------------------------
2mpg |
length | -.1377196 .0151131 -9.11 0.000 -.1673407 -.1080984
_cons | 47.17927 2.868909 16.45 0.000 41.55631 52.80223
------------------------------------------------------------------------------
Endogenous variables: mpg
Exogenous variables: weight length
------------------------------------------------------------------------------
. mfx, predict(xb equation(#1)) nose tracelvl(2)
calculating dydx (linear method)
equation i= 1 : df/d(xb) = 1
equation i= 2 : df/d(xb) = 0
-------------------------
variable | dy/dx
---------+---------------
weight | -.004
length | 0
-------------------------
Marginal effects after reg3
y = Linear prediction (predict, xb equation(#1))
= 21.297297
------------------------------------------------------------------------------
variable | dy/dx X
---------------------------------+--------------------------------------------
weight | -.0040049 3019.46
length | 0 187.932
------------------------------------------------------------------------------
Here, the prediction function is
f(weight, 1, _b[mpg:weight], _b[mpg:_cons], length, _b[2mpg:length],
1, _b[2mpg:_cons])
= _b[mpg:weight]*weight + _b[mpg:_cons].
For the first equation, df/d(xb)=1. For the second equation, df/d(xb)=0
because the sum (_b[2mpg:length]*length + _b[2mpg:_cons]) does not appear in
the prediction function. If that's correct, when we use the second equation
prediction, it should be the other way around:
. mfx, predict(xb equation(#2)) nose tracelvl(2)
calculating dydx (linear method)
equation i= 1 : df/d(xb) = 0
equation i= 2 : df/d(xb) = 1
-------------------------
variable | dy/dx
---------+---------------
weight | 0
length | -.13772
-------------------------
Marginal effects after reg3
y = Linear prediction (predict, xb equation(#2))
= 21.297297
------------------------------------------------------------------------------
variable | dy/dx X
---------------------------------+--------------------------------------------
weight | 0 3019.46
length | -.1377196 187.932
------------------------------------------------------------------------------
Now let’s move onto the standard errors. Again, let’s start
with a single-equation estimator. Say we have variables x with
coefficients b, and we are trying to calculate the second derivatives:
d df
D_ij = ----- ( --- )
db_j dx_i
If i is not equal to j, using the chain rule, we have
d df d df
D_ij = ----- ( --- ) = ----- ( ---- * b_i )
db_j dx_i db_j d(xb)
d df
= b_i * ----- ( ---- )
db_j d(xb)
d df d(xb)
= b_i * ----- ( ---- ) * -------
d(xb) d(xb) db_j
d df
= b_i * ----- ( ---- ) * x_j
d(xb) d(xb)
d df
= b_i * x_j * ----- ( ---- )
d(xb) d(xb)
If i is equal to j, we don't get out of it so easily on the first line.
We’ll need to use the product rule there:
d df d df
D_ii = ----- ( --- ) = ----- ( ---- * b_i )
db_i dx_i db_i d(xb)
df d df
= ---- * 1 + b_i * ---- ( ---- )
d(xb) db_i d(xb)
d df df
= b_i * x_i * ----- ( ---- ) + ----
d(xb) d(xb) d(xb)
We already have df/d(xb) calculated from before (the marginal effect
calculation), so all we have to do is the second derivative,
d^2f/d(xb)d(xb). This is calculated by mfx using
the chain rule:
d df d df
---- ( --- ) = ----- ( ---- ) * b_i
dx_i d(xb) d(xb) d(xb)
so
d df d df
----- ( ---- ) = ----- ( ---- ) /b_i
d(xb) d(xb) dx_i d(xb)
This is true for any x_i, so mfx uses x_1. It finds
the derivative with respect to x_1 by iterating to find h, calculating
(f(x+h)−f(x))/h, and then dividing it by b_1.
Time for an example:
. logit foreign headroom trunk, nolog
Logisitic regression Number of obs = 74
LR chi2(2) = 10.60
Prob > chi2 = 0.0050
Log likelihood = -39.73265 Pseudo R2 = 0.1177
------------------------------------------------------------------------------
foreign | Coef. Std. Err. z p>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
headroom | -.2969306 .4487154 -0.66 0.508 -1.176397 .5825354
trunk | -.1708735 .0906542 -1.88 0.059 -.3485526 .0068055
_cons | 2.222753 1.084936 2.05 0.040 .0963174 4.349189
------------------------------------------------------------------------------
. mfx, predict(xb) tracelvl(2)
calculating dydx (linear method)
equation i= 1 : df/d(xb) = 1
-------------------------
variable | dy/dx
---------+---------------
headroom | -.29693
trunk | -.17087
-------------------------
calculating standard errors (linear method)
equation k = 1:
k= 1, l= 1: d^2f/d(xb_k)d(xb_l) = 0
headroom ... continuous 1 d^2f/dxdb = 1
2 d^2f/dxdb = 0
3 d^2f/dxdb = 0
headroom: Std. Err. = .44871539
trunk ... continuous
1 d^2f/dxdb = 0
2 d^2f/dxdb = 1 3 d^2f/dxdb = 0
trunk: Std. Err. = .09065423
Marginal effects after logit
y = Linear prediction (predict, xb)
= -1.016698
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z p>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
headroom | -.2969306 .44872 -0.66 0.508 -1.1764 .582535 2.99324
trunk | -.1708735 .09065 -1.88 0.059 -.348553 .006806 13.7568
------------------------------------------------------------------------------
Here the prediction function is
f(headroom, trunk, 1, _b[headroom], _b[trunk], _b[_cons])
= _b[headroom]*headroom + _b[trunk]*trunk + _b[_cons]
We can see that df/d(xb)=1, and by this, I mean df/d(xb) is a constant
function. When we take the derivative of df/d(xb) with respect to any x_i,
we get zero. When we substitute this into the above formulas for D_ij, we
will get 1 when i=j and zero otherwise.
Now let’s turn to multiple equations. To unravel the formulas,
we’ll just stick with two equations. Say we call the variables in
the first equation x with coefficients b and the second we’ll call z
with coefficients c.
For variable i, which occurs in equation k, we must work out a second
derivative for each of the other variables. Let’s say we want to
do this for variable j, which occurs in equation l. There are several
possibilities with all these indices: either the variables are in the same
equation or not. If they are in the same equation, either they are equal
or not.
Let’s say that i and j are both in the first equation, and i is
not equal to j:
d df d df d(xb_1) df d(xb_2)
D_ij = ----- ( --- ) = ----- ( ------ * ------ + ----- * ------)
db_j dx_i db_j d(xb_1) dx_i d(xb_2) dx_i
d df
= ----- ( ------ * b_i )
db_j d(xb_1)
since x_i does not appear in the second equation, making d(xb_2)/dx_i=0.
Applying the chain rule again results in:
d df
D_ij = ----- ( ------ * b_i )
db_j d(xb_1)
d df d(xb_1) d df d(xb_2)
= b_i *( ------- ( ------ ) * ------- + ----- ( ----- ) * ------- )
d(xb_1) d(xb_1) db_j d(xb_2) d(xb_1) db_j
d df
= b_i * ------- ( ----- ) * x_j
d(xb_1) d(xb_1)
d df
= b_i * x_j * ------- ( ----- )
d(xb_1) d(xb_1)
If i and j are both in the first equation and i is equal to j, we would need
the product rule and would end up with
d df
D_ii = ----- ( ------ * b_i )
db_i d(xb_1)
df d df d(xb_1)
= ----- * 1 + b_i * (------- ( ------ ) * -------
d(xb_1) d(xb_1) d(xb_1) db_j
d df d(xb_2)
+ ------ ( ----- ) * -------)
d(xb_2) d(xb_1) db_j
df d df
= ------ + b_i * ------- ( ----- ) * x_j
d(xb_1) d(xb_1) d(xb_1)
d df df
= b_i * x_i * ----- ( ---- ) + ----
d(xb_1) d(xb_1) d(xb_1)
Now, let’s suppose that i is in equation 1 and j is in equation 2.
I’ll use c_j for its coefficient to remind us.
d df d df d(xb_1) df d(xb_2)
D_ij = ----- ( --- ) = ----- ( ------ * ------ + ----- * ------)
dc_j dx_i dc_j d(xb_1) dx_i d(xb_2) dx_i
d df
= ----- ( ------ * b_i )
dc_j d(xb_1)
d df d(xb_1) d df d(xb_2)
= b_i * ( ------- ( ------ ) * ------- + ----- ( ----- ) * ------- )
d(xb_1) d(xb_1) dc_j d(xb_2) d(xb_1) dc_j
d df
= b_i * ------- ( ----- ) * z_j
d(xb_2) d(xb_1)
d df
= b_i * z_j * ------- ( ----- )
d(xb_2) d(xb_1)
If we have many equations, it is a lot if work to compute all the second
derivatives: d^2f/d(xb_k)d(xb_l), where k, l = 1...s, and s is the number of
equations. We can save effort on this because the order of differentiation
doesn’t make any difference, meaning:
d df d df
------- ( ----- ) = ------ ( ---- )
d(xb_k) d(xb_l) d(xb_l) d(xb_k)
We only need to actually go through the calculation for l≤k.
Let’s finish off with a multiple-equation example:
. webuse labor, clear
. gen wc=(we>12)
. quietly treatreg ww wa cit, treat(wc=wmed wfed) nolog
. matrix list e(b)
e(b)[1,9]
ww: ww: ww: ww: wc: wc: wc:
wa cit wc _cons wmed wfed _cons
y1 -.01104239 .12763602 1.2713275 2.3186382 .11980547 .09618863 -2.6318761
athrho: lnsigma:
_cons _cons
y1 .01786677 .92415838
. mfx, predict(ptrt) varlist(wmed wfed) tracelvl(2)
calculating dydx (linear method)
equation i= 1 : df/d(xb) = 0
equation i= 2 : df/d(xb) = .31020517
equation i= 3 : df/d(xb) = 0
equation i= 4 : df/d(xb) = 0
-------------------------
variable | dy/dx
---------+---------------
wmed | .03716
wfed | .02984
-------------------------
calculating standard errors (linear method)
equation k = 1:
k= 1, l= 1: d^2f/d(xb_k)d(xb_l) = 0
equation k = 2:
k= 2, l= 1: d^2f/d(xb_k)d(xb_l) = 0
k= 2, l= 2: d^2f/d(xb_k)d(xb_l) = .22004156
equation k = 3:
k= 3, l= 1: d^2f/d(xb_k)d(xb_l) = 0
k= 3, l= 2: d^2f/d(xb_k)d(xb_l) = 0
k= 3, l= 3: d^2f/d(xb_k)d(xb_l) = 0
equation k = 4:
k= 4, l= 1: d^2f/d(xb_k)d(xb_l) = 0
k= 4, l= 2: d^2f/d(xb_k)d(xb_l) = 0
k= 4, l= 3: d^2f/d(xb_k)d(xb_l) = 0
k= 4, l= 4: d^2f/d(xb_k)d(xb_l) = 0
wmed ... continuous
1 d^2f/dxdb = 0
2 d^2f/dxdb = 0
3 d^2f/dxdb = 0
4 d^2f/dxdb = 0
5 d^2f/dxdb = .55105007
6 d^2f/dxdb = .22692566
7 d^2f/dxdb = .02636218
8 d^2f/dxdb = 0
9 d^2f/dxdb = 0
wmed: Std. Err. = .00981917
wfed ... continuous
1 d^2f/dxdb = 0
2 d^2f/dxdb = 0
3 d^2f/dxdb = 0
4 d^2f/dxdb = 0
5 d^2f/dxdb = .19336797
6 d^2f/dxdb = .49239776
7 d^2f/dxdb = .0211655
8 d^2f/dxdb = 0
9 d^2f/dxdb = 0
wfed: Std. Err. = .00903345
Marginal effects after treatreg
y = Pr(wc) (predict, ptrt)
= .23905623
------------------------------------------------------------------------------
variable | dy/dx Std. Err. z P>|z| [ 95% C.I. ] X
---------+--------------------------------------------------------------------
wmed | .0371643 .00982 3.78 0.000 .017919 .05641 9.136
wfed | .0298382 .00903 3.30 0.001 .012133 .047543 8.608
------------------------------------------------------------------------------
Looking at the coefficient matrix stored after the estimation, we see there
are four equations and nine coefficients. The last two equations contain
only constants, so we expect all their derivatives to be zero.
In computing the marginal effects, we see that df/d(xb) is zero for the
first equation, which makes sense because the prediction function I
specified is the probability of treatment. We also see that df/d(xb) is a
nonzero for the second equation, as we would expect.
df/d(xb_2) = .31020517
In computing the standard errors, we see that the second derivatives are
only calculated for l≤k, as expected, and that the only nonzero values
occur in equation 2:
d^2f/d(xb_2)d(xb_2) = .22004165
Let’s check that the nonzero second derivatives agree with our
formulas:
d df df
D(wmed, wmed) = b[wmed] * wmed * ------- ( ----- ) + ------
d(xb_2) d(xb_2) d(xb_2)
= .1198055 * 9.136 *.22004165 + .31020517
= .55105
This is the coefficient of wmed multiplied by the mean of wmed
and then multiplied by d^2f/d(xb_2)d(xb_2) plus df/d(xb_2). Here are the
remaining nonzero derivatives:
d df
D(wmed, wfed) = b[wmed] * wfed * ----- ( ----- )
d(xb_2) d(xb_2)
= .1198055 * 8.608 * .22004165
= .22692582
d df
D(wmed, _cons) = b[wmed] * _cons * ----- ( ----- )
d(xb_2) d(xb_2)
= .1198055 * 1 * .22004165
= .0263622
d df
D(fmed, wmed) = b[wfed] * wmed * ------- ( ----- )
d(xb_2) d(xb_2)
= .0961886 * 9.136 *.22004165
= .19336
d df df
D(fmed, wfed) = b[wfed] * wfed * ----- ( ----- ) + ------
d(xb_2) d(xb_2) d(xb_2)
= .0961886 * 8.608 * .22004165 + .31020517
= .492397
d df
D(fmed, _cons) = b[wfed] * _cons * ----- ( ----- )
d(xb_2) d(xb_2)
= .0961886 * 1 * .22004165
= .0211655
|