Example 2: Load data from Python

There are many ways to load data from Python into Stata’s current dataset in memory. For example:

  • Pandas DataFrames and NumPy arrays can be loaded directly into Stata.

  • The Data and Frame classes within the Stata Function Interface (sfi) module provide multiple methods for loading data from Python.

  • Stata can read in data from a variety of sources, many of which can be created in Python: Excel files, CSV files, SPSS and SAS datasets, and various databases.

In this example, we will show you how to load a pandas DataFrame into Stata and then perform analyses in Stata, using the %%stata magic command.

We have data from the Second National Health and Nutrition Examination Survey (NHANES II; McDowell et al. 1981). NHANES II is part of a study to assess the health and nutritional status of adults and children in the United States. It is designed to be a nationally representative sample of the U.S. population. The data for this particular sample were collected between 1976 and 1980.

This dataset contains 10,351 observations and 58 variables, and is stored in a CSV file called nhanes2.csv. Among these variables, we have each individual’s age and systolic blood pressure (bpsystol). Age has been categorized into six groups: 20-29, 30-39, 40-49, 50-59, 60-69, and 70+. The goal is to read the data into a pandas DataFrame and load the DataFrame into Stata. We want to use Stata’s features to fit a regression model of bpsystol as a function of age group (agegrp) and gender. Then we want to see how the average predicted systolic blood pressure varies across individuals in each age group, and across males and females in each age group.

First, we use the pandas function read_csv() to read the data from the .csv file into a DataFrame named nhanes2.

[5]:
import pandas as pd
import io
import requests

data = requests.get("https://www.stata.com/python/pystata18/misc/nhanes2.csv").content
nhanes2 = pd.read_csv(io.StringIO(data.decode('utf-8')))
nhanes2
[5]:
sampl strata psu region smsa location houssiz sex race age ... region4 smsa1 smsa2 smsa3 rural loglead agegrp highlead bmi highbp
0 1400 1 1 S 2 1 4 Male White 54 ... 0 0 1 0 0 NaN 50-59 NaN 20.495686 0
1 1401 1 1 S 2 1 6 Female White 41 ... 0 0 1 0 0 2.564949 40-49 lead<25 21.022337 0
2 1402 1 1 S 1 1 6 Female Other 21 ... 0 1 0 0 0 NaN 20-29 NaN 24.973860 0
3 1404 1 1 S 2 1 9 Female White 63 ... 0 0 1 0 0 NaN 60-69 NaN 35.728722 1
4 1405 1 1 S 1 1 3 Female White 64 ... 0 1 0 0 0 2.995732 60-69 lead<25 27.923803 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10346 48760 32 2 MW 4 48 5 Female White 35 ... 0 0 0 1 1 NaN 30-39 NaN 20.355173 0
10347 48763 32 2 MW 4 48 2 Female White 33 ... 0 0 0 1 1 1.945910 30-39 lead<25 41.645557 1
10348 48764 32 2 MW 4 48 1 Female White 60 ... 0 0 0 1 0 NaN 60-69 NaN 35.626114 0
10349 48768 32 2 MW 4 48 1 Female White 29 ... 0 0 0 1 0 NaN 20-29 NaN 19.204464 0
10350 48770 32 2 MW 4 48 1 Male White 31 ... 0 0 0 1 1 NaN 30-39 NaN 19.635565 0

10351 rows × 58 columns

Then, we load the DataFrame into Stata by specifying the -d argument of the %%stata magic. Here you could also use the API function pdataframe_to_data() of the stata module to load the pandas DataFrame into Stata. See Call Stata using API functions and Example 5 for more information.

Within Stata, we encode agegrp and sex and label the resulting variables, agegrp2 and sex2, along with bpsystol.

[6]:
%%stata -d nhanes2 -force
label define sex2 1 "Male" 2 "Female"
encode sex, generate(sex2) label(sex2)

label define agegrp 1 "20-29" 2 "30-39" 3 "40-49" 4 "50-59" 5 "60-69" 6 "70+"
encode agegrp, generate(agegrp2) label(agegrp)

label variable bpsystol "systolic blood pressure"
label variable agegrp2 "Age Group"
label variable sex2 "1=Male, 2=Female"

describe bpsystol agegrp2 sex2

. label define sex2 1 "Male" 2 "Female"

. encode sex, generate(sex2) label(sex2)

.
. label define agegrp 1 "20-29" 2 "30-39" 3 "40-49" 4 "50-59" 5 "60-69" 6 "70+"

. encode agegrp, generate(agegrp2) label(agegrp)

.
. label variable bpsystol "systolic blood pressure"

. label variable agegrp2 "Age Group"

. label variable sex2 "1=Male, 2=Female"

.
. describe bpsystol agegrp2 sex2

Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
bpsystol        long    %12.0g                systolic blood pressure
agegrp2         long    %8.0g      agegrp     Age Group
sex2            long    %8.0g      sex2       1=Male, 2=Female

.

We then fit a regression model of systolic blood pressure on age group and sex. We also push Stata’s estimation results, including the coefficient vector e(b) and variance–covariance matrix e(V), into a Python dictionary called steret.

[7]:
%%stata -eret steret
// fit a regression model
regress bpsystol agegrp2##sex2

// e() results
ereturn list

. // fit a regression model
. regress bpsystol agegrp2##sex2

      Source |       SS           df       MS      Number of obs   =    10,351
-------------+----------------------------------   F(11, 10339)    =    312.88
       Model |  1407229.28        11  127929.935   Prob > F        =    0.0000
    Residual |  4227440.75    10,339  408.882943   R-squared       =    0.2497
-------------+----------------------------------   Adj R-squared   =    0.2489
       Total |  5634670.03    10,350  544.412563   Root MSE        =    20.221

------------------------------------------------------------------------------
    bpsystol | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     agegrp2 |
      30-39  |   .7956175   .9473117     0.84   0.401    -1.061297    2.652532
      40-49  |   5.117078   1.018176     5.03   0.000     3.121256      7.1129
      50-59  |   12.20018   1.022541    11.93   0.000      10.1958    14.20456
      60-69  |   16.85887   .8155092    20.67   0.000     15.26031    18.45742
        70+  |   22.50889   1.130959    19.90   0.000     20.29199    24.72579
             |
        sex2 |
     Female  |  -12.60132   .8402299   -15.00   0.000    -14.24833    -10.9543
             |
agegrp2#sex2 |
      30-39 #|
     Female  |   4.140156    1.31031     3.16   0.002     1.571695    6.708617
      40-49 #|
     Female  |   8.644866   1.412067     6.12   0.000     5.876941    11.41279
      50-59 #|
     Female  |   11.83134   1.406641     8.41   0.000     9.074051    14.58863
      60-69 #|
     Female  |     14.093   1.130882    12.46   0.000     11.87625    16.30975
 70+#Female  |   15.86608   1.542296    10.29   0.000     12.84288    18.88928
             |
       _cons |   123.8862   .6052954   204.67   0.000     122.6997    125.0727
------------------------------------------------------------------------------

.
. // e() results
. ereturn list

scalars:
                  e(N) =  10351
               e(df_m) =  11
               e(df_r) =  10339
                  e(F) =  312.8766723590026
                 e(r2) =  .2497447540773559
               e(rmse) =  20.22085415724865
                e(mss) =  1407229.279971501
                e(rss) =  4227440.746112915
               e(r2_a) =  .2489465330013186
                 e(ll) =  -45803.93060947768
               e(ll_0) =  -47291.06810807488
               e(rank) =  12

macros:
            e(cmdline) : "regress bpsystol agegrp2##sex2"
              e(title) : "Linear regression"
          e(marginsok) : "XB default"
                e(vce) : "ols"
             e(depvar) : "bpsystol"
                e(cmd) : "regress"
         e(properties) : "b V"
            e(predict) : "regres_p"
              e(model) : "ols"
          e(estat_cmd) : "regress_estat"

matrices:
                  e(b) :  1 x 21
                  e(V) :  21 x 21
               e(beta) :  1 x 20

functions:
             e(sample)

.

Below, we list the entire contents of the Python dictionary steret.

[8]:
steret
[8]:
{'e(N)': 10351.0,
 'e(df_m)': 11.0,
 'e(df_r)': 10339.0,
 'e(F)': 312.8766723590026,
 'e(r2)': 0.24974475407735586,
 'e(rmse)': 20.220854157248645,
 'e(mss)': 1407229.2799715009,
 'e(rss)': 4227440.746112915,
 'e(r2_a)': 0.24894653300131864,
 'e(ll)': -45803.930609477684,
 'e(ll_0)': -47291.06810807488,
 'e(rank)': 12.0,
 'e(cmdline)': 'regress bpsystol agegrp2##sex2',
 'e(title)': 'Linear regression',
 'e(marginsprop)': 'minus',
 'e(marginsok)': 'XB default',
 'e(vce)': 'ols',
 'e(_r_z_abs__CL)': '|t|',
 'e(_r_z__CL)': 't',
 'e(depvar)': 'bpsystol',
 'e(cmd)': 'regress',
 'e(properties)': 'b V',
 'e(predict)': 'regres_p',
 'e(model)': 'ols',
 'e(estat_cmd)': 'regress_estat',
 'e(b)': array([[  0.        ,   0.79561746,   5.11707797,  12.20017802,
          16.85886868,  22.50888857,   0.        , -12.601317  ,
           0.        ,   0.        ,   0.        ,   4.14015609,
           0.        ,   8.6448661 ,   0.        ,  11.83133884,
           0.        ,  14.09300146,   0.        ,  15.86607901,
         123.88620072]]),
 'e(V)': array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        ,  0.89739937,  0.36638257,  0.36638257,  0.36638257,
          0.36638257,  0.        ,  0.36638257,  0.        ,  0.        ,
          0.        , -0.89739937,  0.        , -0.36638257,  0.        ,
         -0.36638257,  0.        , -0.36638257,  0.        , -0.36638257,
         -0.36638257],
        [ 0.        ,  0.36638257,  1.03668247,  0.36638257,  0.36638257,
          0.36638257,  0.        ,  0.36638257,  0.        ,  0.        ,
          0.        , -0.36638257,  0.        , -1.03668247,  0.        ,
         -0.36638257,  0.        , -0.36638257,  0.        , -0.36638257,
         -0.36638257],
        [ 0.        ,  0.36638257,  0.36638257,  1.04559011,  0.36638257,
          0.36638257,  0.        ,  0.36638257,  0.        ,  0.        ,
          0.        , -0.36638257,  0.        , -0.36638257,  0.        ,
         -1.04559011,  0.        , -0.36638257,  0.        , -0.36638257,
         -0.36638257],
        [ 0.        ,  0.36638257,  0.36638257,  0.36638257,  0.66505528,
          0.36638257,  0.        ,  0.36638257,  0.        ,  0.        ,
          0.        , -0.36638257,  0.        , -0.36638257,  0.        ,
         -0.36638257,  0.        , -0.66505528,  0.        , -0.36638257,
         -0.36638257],
        [ 0.        ,  0.36638257,  0.36638257,  0.36638257,  0.36638257,
          1.27906771,  0.        ,  0.36638257,  0.        ,  0.        ,
          0.        , -0.36638257,  0.        , -0.36638257,  0.        ,
         -0.36638257,  0.        , -0.36638257,  0.        , -1.27906771,
         -0.36638257],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        ,  0.36638257,  0.36638257,  0.36638257,  0.36638257,
          0.36638257,  0.        ,  0.70598634,  0.        ,  0.        ,
          0.        , -0.70598634,  0.        , -0.70598634,  0.        ,
         -0.70598634,  0.        , -0.70598634,  0.        , -0.70598634,
         -0.36638257],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        , -0.89739937, -0.36638257, -0.36638257, -0.36638257,
         -0.36638257,  0.        , -0.70598634,  0.        ,  0.        ,
          0.        ,  1.7169127 ,  0.        ,  0.70598634,  0.        ,
          0.70598634,  0.        ,  0.70598634,  0.        ,  0.70598634,
          0.36638257],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        , -0.36638257, -1.03668247, -0.36638257, -0.36638257,
         -0.36638257,  0.        , -0.70598634,  0.        ,  0.        ,
          0.        ,  0.70598634,  0.        ,  1.99393419,  0.        ,
          0.70598634,  0.        ,  0.70598634,  0.        ,  0.70598634,
          0.36638257],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        , -0.36638257, -0.36638257, -1.04559011, -0.36638257,
         -0.36638257,  0.        , -0.70598634,  0.        ,  0.        ,
          0.        ,  0.70598634,  0.        ,  0.70598634,  0.        ,
          1.97863792,  0.        ,  0.70598634,  0.        ,  0.70598634,
          0.36638257],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        , -0.36638257, -0.36638257, -0.36638257, -0.66505528,
         -0.36638257,  0.        , -0.70598634,  0.        ,  0.        ,
          0.        ,  0.70598634,  0.        ,  0.70598634,  0.        ,
          0.70598634,  0.        ,  1.27889308,  0.        ,  0.70598634,
          0.36638257],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         -0.        ],
        [ 0.        , -0.36638257, -0.36638257, -0.36638257, -0.36638257,
         -1.27906771,  0.        , -0.70598634,  0.        ,  0.        ,
          0.        ,  0.70598634,  0.        ,  0.70598634,  0.        ,
          0.70598634,  0.        ,  0.70598634,  0.        ,  2.37867695,
          0.36638257],
        [-0.        , -0.36638257, -0.36638257, -0.36638257, -0.36638257,
         -0.36638257, -0.        , -0.36638257, -0.        , -0.        ,
         -0.        ,  0.36638257, -0.        ,  0.36638257, -0.        ,
          0.36638257, -0.        ,  0.36638257, -0.        ,  0.36638257,
          0.36638257]]),
 'e(beta)': array([[ 0.        ,  0.01239614,  0.07200443,  0.17276986,  0.32311413,
          0.2832181 ,  0.        , -0.26970688,  0.        ,  0.        ,
          0.        ,  0.0487697 ,  0.        ,  0.090657  ,  0.        ,
          0.12640132,  0.        ,  0.21209664,  0.        ,  0.15095109]])}

You can also access specific elements of the dictionary. For example, you can access e(b) and e(V) by typing steret[‘e(b)’] and steret[‘e(V)’], respectively, in Python.

[9]:
steret['e(b)']
[9]:
array([[  0.        ,   0.79561746,   5.11707797,  12.20017802,
         16.85886868,  22.50888857,   0.        , -12.601317  ,
          0.        ,   0.        ,   0.        ,   4.14015609,
          0.        ,   8.6448661 ,   0.        ,  11.83133884,
          0.        ,  14.09300146,   0.        ,  15.86607901,
        123.88620072]])

Next, we use the margins command to estimate the average predicted systolic blood pressure for each age group, and we graph the results using marginsplot.

[10]:
%%stata
margins agegrp2

Predictive margins                                      Number of obs = 10,351
Model VCE: OLS

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     agegrp2 |
      20-29  |   117.2684    .419845   279.31   0.000     116.4454    118.0914
      30-39  |   120.2383   .5020813   239.48   0.000     119.2541    121.2225
      40-49  |   126.9255     .56699   223.86   0.000     125.8141    128.0369
      50-59  |    135.682   .5628593   241.06   0.000     134.5787    136.7853
      60-69  |   141.5285   .3781197   374.30   0.000     140.7873    142.2696
        70+  |   148.1096   .6445073   229.80   0.000     146.8463     149.373
------------------------------------------------------------------------------
[11]:
%%stata
marginsplot

Variables that uniquely identify margins: agegrp2
../_images/notebook_Example2_12_1.svg

Now, we estimate those expected values separately for males and females in each age group, and we graph the results.

[12]:
%%stata
margins agegrp2#sex2
marginsplot

. margins agegrp2#sex2

Adjusted predictions                                    Number of obs = 10,351
Model VCE: OLS

Expression: Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
agegrp2#sex2 |
 20-29#Male  |   123.8862   .6052954   204.67   0.000     122.6997    125.0727
      20-29 #|
     Female  |   111.2849   .5827553   190.96   0.000     110.1426    112.4272
 30-39#Male  |   124.6818    .728709   171.10   0.000     123.2534    126.1102
      30-39 #|
     Female  |   116.2207    .692755   167.77   0.000     114.8627    117.5786
 40-49#Male  |   129.0033   .8187185   157.57   0.000     127.3984    130.6081
      40-49 #|
     Female  |   125.0468   .7859058   159.11   0.000     123.5063    126.5874
 50-59#Male  |   136.0864   .8241405   165.13   0.000     134.4709    137.7019
      50-59 #|
     Female  |   135.3164   .7703532   175.66   0.000     133.8064    136.8264
 60-69#Male  |   140.7451   .5465096   257.53   0.000     139.6738    141.8163
      60-69 #|
     Female  |   142.2368   .5236736   271.61   0.000     141.2103    143.2633
   70+#Male  |   146.3951   .9553456   153.24   0.000     144.5224    148.2678
 70+#Female  |   149.6599   .8717829   171.67   0.000      147.951    151.3687
------------------------------------------------------------------------------

. marginsplot

Variables that uniquely identify margins: agegrp2 sex2

.
../_images/notebook_Example2_14_1.svg