Call Stata from Python

Zhao Xu

Principal Software Engineer
StataCorp LLC
July 16, 2021

Outline

Introduction

Stata provides a tight integration with Python named PyStata. It contains two parts:

How it works

The pystata Python package includes two sets of tools for interacting with Stata from Python:

The magic commands can be used to access Stata and Mata interactively in an IPython kernel-based environment. The API functions can be used to interact with Stata and Mata from both IPython and command-line environments.

Configuration and initialization

The pystata package is shipped with Stata and is located in STATA_SYSDIR/utilities/pystata directory. To get started, we need to configure the pystata package within Python to initialize Stata.

There are four methods to initialize Stata from within Python. In the first method, the configuration module stata_setup, which is available in the Python Package Index (PyPI), is provided to locate the pystata package to initialize Stata.

Suppose we have Stata installed in C:/Program Files/Stata17/ and we use the Stata/MP edition. Stata can be initialized as follows:

Call Stata using magic commands

The pystata package provides three magic commands to interact with Stata from within the IPython environment:

The stata magic

The stata magic can be used as both a cell magic and a line magic to execute Stata commands.

Execute one line or a block of Stata commands.

When the line magic command %stata is used, a one-line Stata command can be specified and executed, as it would be in Stata's Command window. When the cell magic command %%stata is used, a block of Stata commands can be specified and executed all at once. This is similar to executing a series of commands from a do-file.

Cell magic syntax:

%%stata [-d DATA] [-f DFLIST|ARRLIST] [-force]
 [-doutd DATAFRAME] [-douta ARRAY] [-foutd FRAMELIST] [-fouta FRAMELIST]
 [-ret DICTIONARY] [-eret DICTIONARY] [-sret DICTIONARY] [-qui] [-nogr]
 [-gw WIDTH] [-gh HEIGHT]

Optional arguments:

  -d DATA               Load a NumPy array or pandas DataFrame 
                        into Stata as the current working dataset.

  -f DFLIST|ARRLIST     Load one or multiple NumPy arrays or 
                        pandas DataFrames into Stata as frames. 
                        The arrays and dataframes should be 
                        separated by commas. Each array or 
                        DataFrame is stored in Stata as a separate 
                        frame with the same name.

  -force                Force loading of the NumPy array or pandas 
                        DataFrame into Stata as the current working 
                        dataset, even if the dataset in memory has 
                        changed since it was last saved; or force 
                        loading of the NumPy arrays or pandas DataFrames 
                        into Stata as separate frames even if one or 
                        more of the frames already exist in Stata.

  -doutd DATAFRAME      Save the dataset in memory as a pandas 
                        DataFrame when the cell completes.

  -douta ARRAY          Save the dataset in memory as a NumPy 
                        array when the cell completes.

  -foutd FRAMELIST      Save one or multiple Stata frames as pandas 
                        DataFrames when the cell completes. The Stata 
                        frames should be separated by commas. Each 
                        frame is stored in Python as a pandas 
                        DataFrame. The variable names in each frame 
                        will be used as the column names in the 
                        corresponding dataframe. 

  -fouta FRAMELIST      Save one or multiple Stata frames as NumPy 
                        arrays when the cell completes. The Stata frames 
                        should be separated by commas. Each frame is 
                        stored in Python as a NumPy array.

  -ret DICTIONARY       Store current r() results into a dictionary.

  -eret DICTIONARY      Store current e() results into a dictionary.

  -sret DICTIONARY      Store current s() results into a dictionary.

  -qui                  Run Stata commands but suppress output.

  -nogr                 Do not display Stata graphics.

  -gw WIDTH             Set graph width in inches, pixels, or centimeters; 
                        default is inches.

  -gh HEIGHT            Set graph height in inches, pixels, or centimeters; 
                        default is inches.


Line magic syntax:

%stata stata_cmd

%%stata cell magic

The %%stata magic is used to execute Stata code within a cell.

%stata line magic

The %stata magic provides users a quick way to execute a single-line Stata command.

Arguments

The cell magic %%stata provides arguments to control the execution of Stata’s commands within the cell.

Load dataset from Python

There are many ways to load data from Python into Stata's current dataset in memory. For example

  1. pandas DataFrames and Numpy arrays can be loaded directly into Stata.
  2. The Data and Frame classes within the Stata Function Interface (sfi) module provide multiple methods for loading data from Python.
  3. Stata can read in data from a variety of sources, many of which can be created in Python: Excel files, CSV files, SPSS and SAS dataset, and various databases.

We have data from the Second National Health and Nutrition Examination Survey (NHANES II; McDowell et al. 1981) studying the health and nutritional status of adults and children between 1976 and 1980. We want to use Stata’s features to fit a regression model of bpsystol as a function of age group (agegrp) and gender. Then we want to see how the average predicted systolic blood pressure varies across individuals in each age group, and across males and females in each age group.

We use the -d argument to load the DataFrame into Stata as current dataset. Within Stata, we encode agegrp and sex and label the resulting variables, agegrp2 and sex2, along with bpsystol.

Next, we fit the model and push Stata's estimation results into Python. The estimation results are stored in steret, which is a Python dictionary.

You can access specific elements of the dictionary. For example, you can access e(b) and e(V) by typing steret['e(b)'] and steret['e(V)'] in Python.

Push Stata dataset into Python

The iris dataset consists of four features measured on 50 samples from each of three Iris species. This data is used in Fisher's (1936) article.

Our goal is to build a classifier using those four features to detect the Iris type. We will use the Random Forest classification model within the scikit-learn Python package to achieve this goal.

Now we have 2 NumPy arrays in Python, training and test. Below we split each array into two sub-arrays to store the features and labels separately.

Then we use X_train and y_train to train the classification model.

Next we use X_test and y_test to evaluate the performace of the training model. We also predict the species type of each flower and the probabilities that it belongs to the three species in the test dataset.

Next, in the test frame, we create a Byte variable irispr to store the predicted species types and three float variables to store the probabilities that each flower belongs to the three species types from the array y_pred_prob.

In Stata, we change the current working frame to test. We attach the value label species to irispr, and use the tabulate command to display a classification table. We also list the flowers that have been misclassified.

The mata magic

The mata magic is used to execute Mata code. It can be used as both a line magic and a cell magic command.

Execute one line or a block of Mata code.

When the %mata line magic command is used, one line of Mata code can be specified and executed. This is similar to specifying mata: istmt within Stata. When the %%mata cell magic command is used, a block of Mata code can be specified. The code is executed just as it would be in a do-file.

Cell magic syntax:

%%mata [-m ARRAYLIST] [-outm MATLIST] [-qui] [-c] 

  Execute a block of Mata code. This is equivalent to running a 
  block of Mata code within a do-file. You do not need to 
  explicitly place the code within the mata[:] and end block. 

  Optional arguments:

    -m ARRAYLIST      Load multiple NumPy arrays into Mata's 
                      interactive environment. The array names 
                      should be separated by commas. Each array is 
                      stored in Mata as a matrix with the same name.

    -outm MATLIST     Save Mata matrices as NumPy arrays when the 
                      cell completes. The matrix names should be
                      separated by commas. Each matrix is stored 
                      in Python as a NumPy array.

    -qui              Run Mata code but suppress output.

    -c                This argument specifies that Mata code be 
                      executed in mata: mode. This means that
                      if Mata encounters an error, it will stop 
                      execution and return control to Python. The 
                      error will then be thrown as a Python SystemError 
                      exception.


Line magic syntax:

%mata [-c]

  Enter Mata's interactive environment. This is equivalent to 
  typing mata or mata: in the Stata Command window.

  Optional argument:

    -c                Enter interactive Mata environment in mata:
                      mode. The default is to enter in mata mode.

%mata istmt 

  Run a single-line Mata statement. This is equivalent to executing 
  mata: istmt within Stata.

The pystata magic

The %pystata line magic is used to configure the system and display current system information and settings.

Stata utility commands.

Line magic syntax:

%pystata status

  Display current system information and settings.

%pystata set graph_show True|False [, perm]

  Set whether Stata graphics are displayed. The default is to 
  display the graphics. Note that if multiple graphs are 
  generated, only the last one is displayed. To display multiple
  graphs, use the name() option with Stata's graph commands.

%pystata set graph_size w #[in|px|cm] [, perm]  
%pystata set graph_size h #[in|px|cm] [, perm]
%pystata set graph_size w #[in|px|cm] h #[in|px|cm] [, perm]

  Set dimensions for Stata graphs. The default is a 5.5-inch width 
  and 4-inch height. Values may be specified in inches, pixels, or 
  centimeters; the default unit is inches. Either the width or height 
  must be specified, or both. If only one is specified, the other one 
  is determined by the aspect ratio.          

%pystata set graph_format svg|png|pdf [, perm]

  Set the graphic format used to display Stata graphs. The default 
  is svg. If svg or png is specified, the graphs will be embedded. 
  If pdf is specified, the graphs will be displayed, and exported 
  to PDF files and stored in the current working directory with 
  numeric names, such as 0.pdf, 1.pdf, 2.pdf, etc. Storing the PDF 
  graph files in the current directory allows you to embed them in 
  the notebook when exporting the notebook to a PDF via Latex.

Call Stata using API functions

In addition to the magic commands, you can also call Stata using the API functions defined in the stata module.

We use a dataset containing quarterly turkey sales throughout the 1990s to illustrate how to call Stata using API functions.

We load the pandas DataFrame into Stata using the pdataframe_to_data() function of the stata module.

Next, we declare the data to be time-series data using the run() function.

Afterwards, we fit a autoregressive integrated moving average (ARIMA) model on sales using the arima command. Then we predict the sale values, storing these values in the variable sales_pred.

Then we use the pdataframe_from_data() function to store the time variable t, original sale values sales, and the predicted values sales_pred in a pandas DataFrame named stpred.

Next we plot original turkey sale values and the predictions values in Python.

Summary

Additional resources