Example 4: Work with Mata

In this example, we will illustrate how to use the %%mata magic command to combine Python’s capabilities with features of Mata, Stata’s matrix programming language. For more information about this magic command, see The mata magic.

Below, we will see how to use Mata to fit a multiple linear regression model using data from Python, and we will push the estimation results back to Python. Consider the following model with \(n\) observations on \(k\) independent variables, \(x_1\), \(x_2\), …, \(x_k\), and one response variable, \(y\):

\[y_i=\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ... + \beta_kx_{ik} + \epsilon_i\]

where \(i=1,...,n\).

Let \(X\) denote the matrix of observations on the right-hand-side variables, and let \(y\) denote the vector of observations on the left-hand-side variable. The ordinary least-squares point estimates are given by

\[b = (X'X)^{-1}X'y\]

and the variance–covariance matrix of the estimators (VCE) is given by

\[V = s^2(X'X)^{-1}\]

where

\[\begin{split}\begin{array}{c} s^2 = \frac{e'e}{n-k} \\ e = y - Xb \\ \end{array}\end{split}\]

For illustrative purposes, we generate two random NumPy arrays in Python, X and y, representing the observations of the independent variables and the response variable. Our goal is to get the coefficients \(b\) and their standard errors, \(t\) statistics, and \(p\)-values.

[23]:
import numpy as np
np.random.seed(17)

# 100,000 observations and 5 variables
X = np.random.random((100000, 5))
y = np.random.random((100000, 1))

Then, we push X and y to Mata by using the magic command %%mata, and we fit the model.

[24]:
%%mata -m X,y
//add the constant term to X
cons = J(100000,1,1)
X = (X, cons)

// calculate b
b = invsym(X'X)*X'y

// calcuate residuals and square of s
e  = y - X*b
n  = rows(X)
k  = cols(X)
s2 = (e'e)/(n-k)

// calculate V
V = s2*invsym(X'X)

. mata
------------------------------------------------- mata (type end to exit) -----
: //add the constant term to X
: cons = J(100000,1,1)

: X = (X, cons)

:
: // calculate b
: b = invsym(X'X)*X'y

:
: // calcuate residuals and square of s
: e  = y - X*b

: n  = rows(X)

: k  = cols(X)

: s2 = (e'e)/(n-k)

:
: // calculate V
: V = s2*invsym(X'X)

: end
-------------------------------------------------------------------------------

.

Next, we calculate the standard errors, \(t\) statistics, and \(p\)-values. (The standard errors are just the square roots of the diagonals of V.)

[25]:
%%mata
se = sqrt(diagonal(V))
tstat = b:/se
pval = 2*ttail(n-k, abs(b:/se))

. mata
------------------------------------------------- mata (type end to exit) -----
: se = sqrt(diagonal(V))

: tstat = b:/se

: pval = 2*ttail(n-k, abs(b:/se))

: end
-------------------------------------------------------------------------------

.

:/ is Mata’s colon operator that performs element-by-element division. Next, we store the above results into a Mata matrix ols_est and push it back to Python.

[26]:
%%mata -outm ols_est
ols_est = (b, se, tstat, pval)
ols_est

. mata
------------------------------------------------- mata (type end to exit) -----
: ols_est = (b, se, tstat, pval)

: ols_est
                  1              2              3              4
    +-------------------------------------------------------------+
  1 |  -.0038839795    .0031646593    -1.22729782    .2197135641  |
  2 |  -.0013711698     .003172241   -.4322401082    .6655678436  |
  3 |  -.0024737898    .0031702694    -.780308998    .4352108648  |
  4 |   .0010630469    .0031691361    .3354374569    .7372958202  |
  5 |  -.0005163881    .0031690687   -.1629463188    .8705610303  |
  6 |   .5044300809    .0036534944    138.0678398              0  |
    +-------------------------------------------------------------+

: end
-------------------------------------------------------------------------------

.

In Python, we now have a NumPy array named ols_est with the same contents as the above Mata matrix.

[27]:
ols_est
[27]:
array([[-3.88397947e-03,  3.16465931e-03, -1.22729782e+00,
         2.19713564e-01],
       [-1.37116981e-03,  3.17224103e-03, -4.32240108e-01,
         6.65567844e-01],
       [-2.47378977e-03,  3.17026944e-03, -7.80308998e-01,
         4.35210865e-01],
       [ 1.06304695e-03,  3.16913608e-03,  3.35437457e-01,
         7.37295820e-01],
       [-5.16388075e-04,  3.16906868e-03, -1.62946319e-01,
         8.70561030e-01],
       [ 5.04430081e-01,  3.65349441e-03,  1.38067840e+02,
         0.00000000e+00]])

When accessing Mata matrices from Python, you can also use the Mata class defined in the Stata Function Interface (sfi) module. For example, you can use the get() method to store a Mata matrix in Python.

[28]:
from sfi import Mata
ols_est2 = np.array(Mata.get("ols_est"))
[29]:
ols_est2
[29]:
array([[-3.88397947e-03,  3.16465931e-03, -1.22729782e+00,
         2.19713564e-01],
       [-1.37116981e-03,  3.17224103e-03, -4.32240108e-01,
         6.65567844e-01],
       [-2.47378977e-03,  3.17026944e-03, -7.80308998e-01,
         4.35210865e-01],
       [ 1.06304695e-03,  3.16913608e-03,  3.35437457e-01,
         7.37295820e-01],
       [-5.16388075e-04,  3.16906868e-03, -1.62946319e-01,
         8.70561030e-01],
       [ 5.04430081e-01,  3.65349441e-03,  1.38067840e+02,
         0.00000000e+00]])

In the above example, we divided the calculation into multiple steps by invoking Mata multiple times. But, you can combine all those steps into one %%mata call.

[30]:
%%mata -m X,y -outm ols_est

//add the constant term to X
cons = J(100000,1,1)
X = (X, cons)

// calculate b
b = invsym(X'X)*X'y

// calculate residuals and square of s
e  = y - X*b
n  = rows(X)
k  = cols(X)
s2 = (e'e)/(n-k)

// calculate V
V = s2*invsym(X'X)

se = sqrt(diagonal(V))
tstat = b:/se
pval = 2*ttail(n-k, abs(b:/se))

ols_est = (b, se, tstat, pval)
ols_est

. mata
------------------------------------------------- mata (type end to exit) -----
:
: //add the constant term to X
: cons = J(100000,1,1)

: X = (X, cons)

:
: // calculate b
: b = invsym(X'X)*X'y

:
: // calculate residuals and square of s
: e  = y - X*b

: n  = rows(X)

: k  = cols(X)

: s2 = (e'e)/(n-k)

:
: // calculate V
: V = s2*invsym(X'X)

:
: se = sqrt(diagonal(V))

: tstat = b:/se

: pval = 2*ttail(n-k, abs(b:/se))

:
: ols_est = (b, se, tstat, pval)

: ols_est
                  1              2              3              4
    +-------------------------------------------------------------+
  1 |  -.0038839795    .0031646593    -1.22729782    .2197135641  |
  2 |  -.0013711698     .003172241   -.4322401082    .6655678436  |
  3 |  -.0024737898    .0031702694    -.780308998    .4352108648  |
  4 |   .0010630469    .0031691361    .3354374569    .7372958202  |
  5 |  -.0005163881    .0031690687   -.1629463188    .8705610303  |
  6 |   .5044300809    .0036534944    138.0678398              0  |
    +-------------------------------------------------------------+

: end
-------------------------------------------------------------------------------

.

We can compare ols_est with the results reported by Stata’s regress command. Below, we will use the %%stata magic command to execute it. Before we do that, we combine X and y into one NumPy array called data.

[31]:
data = np.concatenate((X, y), axis=1)

Then, we push data to Stata. When reading a NumPy array into Stata, the variables are named v1, v2, … by default, so we rename them. Then, we fit the regression model.

[32]:
%%stata -d data -force
describe

// rename the last variable to y
rename v6 y

// rename the other variables, prefixing them with an x
rename v* x*

regress y x1-x5

. describe

Contains data
 Observations:       100,000
    Variables:             6
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
v1              double  %10.0g
v2              double  %10.0g
v3              double  %10.0g
v4              double  %10.0g
v5              double  %10.0g
v6              double  %10.0g
-------------------------------------------------------------------------------
Sorted by:
     Note: Dataset has changed since last saved.

.
. // rename the last variable to y
. rename v6 y

.
. // rename the other variables, prefixing them with an x
. rename v* x*

.
. regress y x1-x5

      Source |       SS           df       MS      Number of obs   =   100,000
-------------+----------------------------------   F(5, 99994)     =      0.49
       Model |  .204607486         5  .040921497   Prob > F        =    0.7847
    Residual |  8367.33648    99,994  .083678386   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |  8367.54109    99,999  .083676248   Root MSE        =    .28927

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   -.003884   .0031647    -1.23   0.220    -.0100867    .0023187
          x2 |  -.0013712   .0031722    -0.43   0.666    -.0075887    .0048464
          x3 |  -.0024738   .0031703    -0.78   0.435    -.0086875    .0037399
          x4 |    .001063   .0031691     0.34   0.737    -.0051484    .0072745
          x5 |  -.0005164   .0031691    -0.16   0.871    -.0067277    .0056949
       _cons |   .5044301   .0036535   138.07   0.000     .4972693    .5115909
------------------------------------------------------------------------------

.

As you can see, the results shown above are the same as those we obtained in the Mata matrix when using the %%mata magic command.