PyStata—Python integration

Order

Watch video demo

<- See Stata's other features

Highlights

Python integration with Stata

Invoke Python interactively
Embed Python code in a do-file
Run a Python script file within Stata
Embed Python code in an ado-file

Stata Function Interface (sfi) Python module

Access Stata's core features in Python, including datasets, frames, macros, scalars, matrices, value labels, characteristics, and more
Store Python results back into Stata

Embed and execute Python code from within Stata. Stata's python command provides a suite of subcommands allowing you to easily call Python from Stata and output Python results within Stata.

You can invoke Python interactively or in do-files and ado-files so that you can leverage Python's extensive language features. You can also execute a Python script file (.py) directly through Stata.

PyStata also lets you call Stata from Python and use Stata in Jupyter Notebook.

Additionally, the Stata Function Interface (sfi) Python module is included. It provides a bidirectional connection between Stata and Python. It allows you to interact Python's capabilities with Stata's core features. Within the module, classes are defined to provide access to Stata's current dataset, frames, macros, scalars, matrices, value labels, characteristics, global Mata matrices, and more.

Let's see it work

The first time you call python in Stata, Stata will search for Python installations on the system and choose the one with the highest version. Once Stata finds the candidate with the highest version, it will save that information to use in the future. You can then start your Python journey within Stata. Next we will show you how to invoke Python from Stata.

Invoke Python interactively

You can type python in the Stata Command window to enter the Python environment. Think of this as an interactive Python shell. You can use it much like you can use Mata (Stata's built-in matrix programming language) interactively. For example, you could type

. python

                                                python (type end to exit)        
>>> print('Hello, Python!')
Hello, Python!
>>> mylist = ['abcd', 123, 1.23, 'efg']
>>> for i in range(3):
...    print(i)
...
0
1
2
>>> end
                                                                                  
.

Embed Python code in a do-file

It is easy to embed Python code in a do-file. All you need to do is place the Python code within a python and end block.

We will use the famous iris dataset as an illustration. This dataset is used in Fisher's (1936) article. Fisher obtained the iris data from Anderson (1935). The data consist of 4 features measured on 50 samples from each of 3 Iris species. The 4 features are the length and width of the sepal and petal. The 3 species are Iris setosa, Iris versicolor, and Iris virginica.

. use http://www.stata-press.com/data/r19/iris, clear
(Iris data)

. describe

Contains data from http://www.stata-press.com/data/r19/iris
 Observations:           150                  Iris data
    Variables:             5                  18 Jan 2024 13:23
                                              (_dta has notes)



Variable      Storage   Display    Value
    name         type    format    label      Variable label 

iris            byte    %10.0g     species    Iris species
seplen          double  %4.1f                 Sepal length in cm
sepwid          double  %4.1f                 Sepal width in cm
petlen          double  %4.1f                 Petal length in cm
petwid          double  %4.1f                 Petal width in cm


Sorted by:

Our goal is to build a classifier using those features to d etect the iris type. To do so, we will use the support vector machine (SVM) classifier within the scikit-learn Python package. Note that you need to install the Matplotlib, sklearn, and NumPy packages in your current Python installation to run the following example. Before using Matplotlib with Stata, you may need to set the backend for different Python installations. See Using Matplotlib with Stata for a discussion about calling Matplotlib from Stata. We put the following code in the Do-file Editor and execute it:

. use http://www.stata-press.com/data/r19/iris, clear
(Iris data)

python:
from sfi import Data
import numpy as np
from sklearn.svm import SVC
import matplotlib.pyplot as plt

# Use the sfi Data class to pull data from Stata variables into Python
X = np.array(Data.get("seplen sepwid petlen petwid"))
y = np.array(Data.get("iris"))

# Draw a graph in Python and save as samplepy.png
fig = plt.figure(1, figsize=(10, 8))
ax = plt.axes(projection='3d')
ax.view_init(elev=-155, azim=105)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, s=30)
ax.set_xlabel("Sepal length (.cm)")
ax.set_ylabel("Sepal width (.cm)")
ax.set_zlabel("Petal length (.cm)")
plt.savefig("samplepy.png")

# Use the data to train C-Support Vector Classifier
svc_clf = SVC(gamma='auto')
svc_clf.fit(X, y)

# Obtain prediction and store back into new Stata variable irispr
irispr = svc_clf.predict(X)
Data.addVarByte('irispr')
Data.store('irispr', None, irispr)
end

* See results in Stata
label values irispr species
label variable irispr predicted
tabulate iris irispr, row

In the code above, we did the following:

Imported all the modules, functions, and objects we were going to use.
Loaded the four features and the species type into two NumPy arrays, X and y, respectively. Note that for simplicity, we do not split our dataset here and we used all the instances as our training samples.
Drew a 3D scatterplot using Matplotlib.
Built an SVC classifier for classification defined in the sklearn package and fit the model.
Made predictions for the dataset using the trained classifier.
Stored the predictions into a new variable, irispr, in Stata.
Back in Stata, attached the value label of iris onto irispr and used the tabulate command to display a classification table.

We saved the above code in samplepy.do and ran

. do samplepy

which produced the following image and output:



 Key

   frequency    
 row percentage 



       Iris              predicted                      
    species      Setosa  Versicolo  Virginica     Total

     Setosa          50          0          0        50
                100.00       0.00       0.00    100.00

 Versicolor           0         48          2        50
                   0.00      96.00       4.00    100.00

  Virginica           0          0         50        50
                   0.00       0.00     100.00    100.00

      Total          50         48         52       150
                  33.33      32.00      34.67    100.00

The above table shows that two Iris versicolor observations were misclassified as Iris virginica, and no Iris setosa or Iris virginica were misclassified.

Embed Python code in an ado-file

Python code can be embedded and executed in ado-files too. Below, we create the command mysvm in mysvm.ado to illustrate this purpose. mysvm expects a label variable to be specified first, followed by a list of feature variables along with a predict() option specifying the name of the variable where the prediction will be stored.

program mysvm
    version 18
    syntax varlist, predict(name)

    gettoken label feature : varlist

    //call the Python function
    python: dosvm("`label'", "`feature'", "`predict'")
end

version 18
python:
from sfi import Data
import numpy as np
from sklearn.svm import SVC

def dosvm(label, features, predict):
    X = np.array(Data.get(features))
    y = np.array(Data.get(label))

    svc_clf = SVC(gamma='auto')
    svc_clf.fit(X, y)

    y_pred = svc_clf.predict(X)

    Data.addVarByte(predict)
    Data.store(predict, None, y_pred)

end

In the above ado-file, we defined the classifier within the Python function dosvm(), which took the species type variable, the four feature variables, and the new variable storing the predictions as arguments. We called the Python function using the python: istmt syntax in the ado-code.

. use http://www.stata-press.com/data/r19/iris, clear

. describe

. mysvm iris seplen sepwid petlen petwid, predict(irispr)

. label variable irispr predicted

. label values irispr species

. tabulate iris irispr, row

References

Anderson, E. 1935. The irises of the Gaspé Peninsula. Bulletin of the American Iris Society 59: 2–5.

Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Annals of Eugenics 7: 179–188.

Hunter, J.D. 2007 "Matplotlib: A 2D Graphics Environment". Computing in Science & Engineering 9: 90–95.

Oliphant, T. E. 2006. A Guide to NumPy, 2nd. Ed. Austin, TX: Continuum Press.

Pedregosa F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. JMLR 12: 2825–2830.

van der Walt, S. C. Colbert, and G. Varoquaux. 2011. The NumPy Array: A Structure for Efficient Numerical Computation. Computing in Science & Engineering. 13: 22–30. DOI:10.1109/MCSE.2011.37 (publisher link).

Wikipedia contributors. 2019. Iris flower data set. Wikipedia, The Free Encyclopedia. 2019 Jun 19, 18:30 UTC [cited 2019 Jun 24].

Tell me more

Learn more about Stata's programming features.

Read more about how to call Python from Stata in the Stata Programming Reference Manual; see [P] PyStata integration.

Read more about how to call Stata from Python in the Stata Programming Reference Manual; see [P] PyStata intro.

Read more about the Stata Function Interface (sfi) Python module from stata.com/python/api.

Products

New in Stata 19

Why Stata

All features

Disciplines

Stata/MP

StataNow

Order Stata

Purchase

Order Stata

Bookstore

Stata Press

Stata Journal

Gift Shop

Learn

Free webinars

NetCourses

Web training

Organizational training

Video tutorials

Third-party courses

Web resources

Teaching with Stata

Support

Training

Video tutorials

FAQs

Statalist: The Stata Forum

Resources

Technical support

Customer service

Alerts

Company

News and events

Customer service

Careers

We use cookies

We use cookies to ensure that we give you the best experience on our website—to enhance site navigation, to analyze usage, and to assist in our marketing efforts. By continuing to use our site, you consent to the storing of cookies on your device and agree to delivery of content, including web fonts and JavaScript, from third party web services.

Cookie Settings

Privacy policy

Last updated: 16 November 2022

StataCorp LLC (StataCorp) strives to provide our users with exceptional products and services. To do so, we must collect personal information from you. This information is necessary to conduct business with our existing and potential customers. We collect and use this information only where we may legally do so. This policy explains what personal information we collect, how we use it, and what rights you have to that information.

Required cookies

Advertising cookies

Required cookies

These cookies are essential for our website to function and do not store any personally identifiable information. These cookies cannot be disabled.
Advertising and performance cookies

This website uses cookies to provide you with a better user experience. A cookie is a small piece of data our website stores on a site visitor's hard drive and accesses each time you visit so we can improve your access to our site, better understand how you use our site, and serve you content that may be of interest to you. For instance, we store a cookie when you log in to our shopping cart so that we can maintain your shopping cart should you not complete checkout. These cookies do not directly store your personal information, but they do support the ability to uniquely identify your internet browser and device.

Please note: Clearing your browser cookies at any time will undo preferences saved here. The option selected here will apply only to the device you are currently using.

Accept Cookies

	python (type end to exit)
>>> print('Hello, Python!')
Hello, Python!
>>> mylist = ['abcd', 123, 1.23, 'efg']
>>> for i in range(3):
... print(i)
...
0
1
2
>>> end

.


Variable Storage Display Value name type format label Variable label

iris byte %10.0g species Iris species seplen double %4.1f Sepal length in cm sepwid double %4.1f Sepal width in cm petlen double %4.1f Petal length in cm petwid double %4.1f Petal width in cm

Iris	predicted
species	Setosa Versicolo Virginica	Total

Setosa	50 0 0	50
	100.00 0.00 0.00	100.00

Versicolor	0 48 2	50
	0.00 96.00 4.00	100.00

Virginica	0 0 50	50
	0.00 0.00 100.00	100.00

Total	50 48 52	150
	33.33 32.00 34.67	100.00