Data (sfi.Data)

class sfi.Data

This class provides access to the current Stata dataset. All variable and observation numbering begins at 0. The allowed values for the variable index var and the observation index obs are

-nvar <= var < nvar

and

-nobs <= obs < nobs

Here nvar is the number of variables defined in the dataset currently loaded in Stata, which is returned by getVarCount(). nobs is the number of observations defined in the dataset currently loaded in Stata, which is returned by getObsTotal().

Negative values for var and obs are allowed and are interpreted in the usual way for Python indexing. In all functions that take var as an argument, var can be specified as either the variable index or the variable name. Note that passing the variable index will be more efficient because looking up the index for the specified variable name is avoided for each function call.

Method Summary

addObs(n[, nofill]) Add n observations to the current Stata dataset.
addVarByte(name) Add a variable of type byte to the current Stata dataset.
addVarDouble(name) Add a variable of type double to the current Stata dataset.
addVarFloat(name) Add a variable of type float to the current Stata dataset.
addVarInt(name) Add a variable of type int to the current Stata dataset.
addVarLong(name) Add a variable of type long to the current Stata dataset.
addVarStr(name, length) Add a variable of type str to the current Stata dataset.
addVarStrL(name) Add a variable of type strL to the current Stata dataset.
allocateStrL(sc, size[, binary]) Allocate a strL so that a buffer can be stored using writeBytes(); the contents of the strL will not be initialized.
dropVar(var) Drop the specified variables from the current Stata dataset.
get([var, obs, selectvar, valuelabel, …]) Read values from the current Stata dataset.
getAsDict([var, obs, selectvar, valuelabel, …]) Read values from the current Stata dataset and store them in a dictionary.
getAt(var, obs) Read a value from the current Stata dataset.
getBestType(value) Get the best numeric data type for the specified value.
getFormattedValue(var, obs, bValueLabel) Read a value from the current Stata dataset, applying its display format.
getMaxStrLength() Get the maximum length of a Stata string variable of type str.
getMaxVars() Get the maximum number of variables Stata currently allows.
getObsTotal() Get the number of observations in the current Stata dataset.
getStrVarWidth(var) Get the width of the variable of type str.
getVarCount() Get the number of variables in the current Stata dataset.
getVarFormat(var) Get the format for the Stata variable.
getVarIndex(name) Look up the variable index for the specified name in the current Stata dataset.
getVarLabel(var) Get the label for the Stata variable.
getVarName(index) Get the name for the Stata variable.
getVarType(var) Get the storage type for the Stata variable, such as unknown, byte, int, long, float, double, strL, str18, etc.
isAlias(var) Test if a variable is an alias to a variable in another frame.
isVarTypeStr(var) Test if a variable is of type str.
isVarTypeString(var) Test if a variable is of type string.
isVarTypeStrL(var) Test if a variable is of type strL.
keepVar(var) Keep the specified variables.
list([var, obs]) List values from the current Stata dataset.
readBytes(sc, length) Read a sequence of bytes from a strL in the current Stata dataset.
renameVar(var, name) Rename a Stata variable.
setObsTotal(nobs) Set the number of observations in the current Stata dataset.
setVarFormat(var, format) Set the format for a Stata variable.
setVarLabel(var, label) Set the label for a Stata variable.
store(var, obs, val[, selectvar]) Store values in the current Stata dataset.
storeAt(var, obs, val) Store a value in the current Stata dataset.
storeBytes(sc, b, binary) Store a byte buffer to a strL in the current Stata dataset.
writeBytes(sc, b[, off, length]) Write length bytes from the specified byte buffer starting at offset off to a strL in the current Stata dataset; the strL must be allocated using allocateStrL() before calling this method.

Method Detail

static addObs(n, nofill=False)

Add n observations to the current Stata dataset. By default, the added observations are filled with the appropriate missing-value code. If nofill is specified and equal to True, the added observations are not filled, which speeds up the process. Setting nofill to True is not recommended. If you choose this setting, it is your responsibility to ensure that the added observations are ultimately filled in or removed before control is returned to Stata.

There need not be any variables defined to add observations. If you are attempting to create a dataset from nothing, you can add the observations first and then add the variables.

Parameters:
  • n (int) – Number of observations to add.
  • nofill (bool, optional) – Do not fill the added observations. Default is False.
Raises:

ValueError – If the number of observations to add, n, exceeds the limit of observations.

static addVarByte(name)

Add a variable of type byte to the current Stata dataset.

Parameters:name (str) – Name of the variable to be created.
Raises:ValueError – If name is not a valid Stata variable name.
static addVarDouble(name)

Add a variable of type double to the current Stata dataset.

Parameters:name (str) – Name of the variable to be created.
Raises:ValueError – If name is not a valid Stata variable name.
static addVarFloat(name)

Add a variable of type float to the current Stata dataset.

Parameters:name (str) – Name of the variable to be created.
Raises:ValueError – If name is not a valid Stata variable name.
static addVarInt(name)

Add a variable of type int to the current Stata dataset.

Parameters:name (str) – Name of the variable to be created.
Raises:ValueError – If name is not a valid Stata variable name.
static addVarLong(name)

Add a variable of type long to the current Stata dataset.

Parameters:name (str) – Name of the variable to be created.
Raises:ValueError – If name is not a valid Stata variable name.
static addVarStr(name, length)

Add a variable of type str to the current Stata dataset.

Parameters:
  • name (str) – Name of the variable to be created.
  • length (int) – Initial size of the variable. If the length is greater than getMaxStrLength(), then a variable of type strL will be created.
Raises:

ValueError – This error can be raised if

  • name is not a valid Stata variable name.
  • length is not a positive integer.
static addVarStrL(name)

Add a variable of type strL to the current Stata dataset.

Parameters:name (str) – Name of the variable to be created.
Raises:ValueError – If name is not a valid Stata variable name.
static allocateStrL(sc, size, binary=True)

Allocate a strL so that a buffer can be stored using writeBytes(); the contents of the strL will not be initialized.

Parameters:
  • sc (StrLConnector) – The StrLConnector representing a strL.
  • size (int) – The size in bytes.
  • binary (bool, optional) – Mark the data as binary. Note that if the data are not marked as binary, Stata expects that the data be UTF-8 encoded. An alternate approach is to call storeAt(), where the encoding is automatically handled. Default is True.
static dropVar(var)

Drop the specified variables from the current Stata dataset.

Parameters:var (int, str, or list-like) – Variables to drop. It can be specified as a single variable index or name, or an iterable of variable indices or names.
Raises:ValueError – If any of the variable indices or names specified in var is out of range or not found.
static get(var=None, obs=None, selectvar=None, valuelabel=False, missingval=_DefaultMissing())

Read values from the current Stata dataset.

Parameters:
  • var (int, str, or list-like, optional) – Variables to access. It can be specified as a single variable index or name, or an iterable of variable indices or names. If var is not specified, all the variables are specified.
  • obs (int or list-like, optional) – Observations to access. It can be specified as a single observation index or an iterable of observation indices. If obs is not specified, all the observations are specified.
  • selectvar (int or str, optional) – Observations for which selectvar!=0 will be selected. If selectvar is an integer, it is interpreted as a variable index. If selectvar is a string, it should contain the name of a Stata variable. Specifying selectvar as “” has the same result as not specifying selectvar, which means no observations are excluded. Specifying selectvar as -1 means that observations with missing values for the variables specified in var are to be excluded.
  • valuelabel (bool, optional) – Use the value label when available. Default is False.
  • missingval (_DefaultMissing, optional) – If missingval is specified, all the missing values in the returned list are replaced by this value. If it is not specified, the numeric value of the corresponding missing value in Stata is returned.
Returns:

A list of lists containing the values from the dataset in memory. Each sublist contains values for one observation.

Return type:

List

Raises:

ValueError – This error can be raised if

  • any of the variable indices or names specified in var is out of range or not found.
  • any of the observation indices specified in obs is out of range.
  • selectvar is out of range or not found.

Notes

The definition of the utility class _DefaultMissing is as follows:

class _DefaultMissing:
    def __repr__(self):
        return "_DefaultMissing()"

This class is defined only for the purpose of specifying the default value for the parameter missingval of the above function. Users are not recommended to use this class for any other purpose.

static getAsDict(var=None, obs=None, selectvar=None, valuelabel=False, missingval=_DefaultMissing())

Read values from the current Stata dataset and store them in a dictionary. The keys are the variable names. The values are the data values for the corresponding variables.

Parameters:
  • var (int, str, or list-like, optional) – Variables to access. It can be specified as a single variable index or name, or an iterable of variable indices or names. If var is not specified, all the variables are specified.
  • obs (int or list-like, optional) – Observations to access. It can be specified as a single observation index or an iterable of observation indices. If obs is not specified, all the observations are specified.
  • selectvar (int or str, optional) – Observations for which selectvar!=0 will be selected. If selectvar is an integer, it is interpreted as a variable index. If selectvar is a string, it should contain the name of a Stata variable. Specifying selectvar as “” has the same result as not specifying selectvar, which means no observations are excluded. Specifying selectvar as -1 means that observations with missing values for the variables specified in var are to be excluded.
  • valuelabel (bool, optional) – Use the value label when available. Default is False.
  • missingval (_DefaultMissing, optional) – If missingval is specified, all the missing values in the returned dictionary are replaced by this value. If it is not specified, the numeric value of the corresponding missing value in Stata is returned.
Returns:

Return a dictionary containing the data values from the dataset in memory.

Return type:

dictionary

Raises:

ValueError – This error can be raised if

  • any of the variable indices or names specified in var is out of range or not found.
  • any of the observation indices specified in obs is out of range.
  • selectvar is out of range or not found.
static getAt(var, obs)

Read a value from the current Stata dataset.

Parameters:
  • var (int or str) – Variable to access. It can be specified as the variable index or name.
  • obs (int) – Observation to access.
Returns:

The value.

Return type:

float or str

Raises:

ValueError – This error can be raised if

  • var is out of range or not found.
  • obs is out of range.
static getBestType(value)

Get the best numeric data type for the specified value.

Parameters:value (float) – The value to test.
Returns:The best numeric data type for the specified value. It may be byte, int, long, float, or double.
Return type:str
static getFormattedValue(var, obs, bValueLabel)

Read a value from the current Stata dataset, applying its display format.

Parameters:
  • var (int or str) – Variable to access. It can be specified as the variable index or name.
  • obs (int) – Observation to access.
  • bValueLabel (bool) – Use the value label when available.
Returns:

The formatted value as a string.

Return type:

str

Raises:

ValueError – This error can be raised if

  • var is out of range or not found.
  • obs is out of range.
static getMaxStrLength()

Get the maximum length of a Stata string variable of type str.

Returns:The maximum length.
Return type:int
static getMaxVars()

Get the maximum number of variables Stata currently allows.

Returns:The maximum number of variables.
Return type:int
static getObsTotal()

Get the number of observations in the current Stata dataset.

Returns:The number of observations.
Return type:int
static getStrVarWidth(var)

Get the width of the variable of type str.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:The width of the variable.
Return type:int
Raises:ValueError – If var is out of range or not found.
static getVarCount()

Get the number of variables in the current Stata dataset.

Returns:The number of variables.
Return type:int
static getVarFormat(var)

Get the format for the Stata variable.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:The variable format.
Return type:str
Raises:ValueError – If var is out of range or not found.
static getVarIndex(name)

Look up the variable index for the specified name in the current Stata dataset.

Parameters:name (str) – Variable to access.
Returns:The variable index.
Return type:int
Raises:ValueError – If name is not found.
static getVarLabel(var)

Get the label for the Stata variable.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:The variable label.
Return type:str
Raises:ValueError – If var is out of range or not found.
static getVarName(index)

Get the name for the Stata variable.

Parameters:index (int) – Variable to access.
Returns:The variable name at the given index.
Return type:str
Raises:ValueError – If index is out of range.
static getVarType(var)

Get the storage type for the Stata variable, such as unknown, byte, int, long, float, double, strL, str18, etc.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:The variable storage type.
Return type:str
Raises:ValueError – If var is out of range or not found.
static isAlias(var)

Test if a variable is an alias to a variable in another frame.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:True if the variable is an alias.
Return type:bool
Raises:ValueError – If var is out of range or not found.
static isVarTypeStr(var)

Test if a variable is of type str.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:True if the variable is of type str.
Return type:bool
Raises:ValueError – If var is out of range or not found.
static isVarTypeString(var)

Test if a variable is of type string.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:True if the variable is of type str or strL.
Return type:bool
Raises:ValueError – If var is out of range or not found.
static isVarTypeStrL(var)

Test if a variable is of type strL.

Parameters:var (int or str) – Variable to access. It can be specified as the variable index or name.
Returns:True if the variable is of type strL.
Return type:bool
Raises:ValueError – If var is out of range or not found.
static keepVar(var)

Keep the specified variables.

Parameters:var (int, str, or list-like) – Variables to keep. It can be specified as a single variable index or name, or an iterable of variable indices or names.
Raises:ValueError – If any of the variable indices or names specified in var is out of range or not found.
static list(var=None, obs=None)

List values from the current Stata dataset. The values are displayed using their corresponding variable formats.

Parameters:
  • var (int, str, or list-like, optional) – Variables to display. It can be specified as a single variable index or name, or an iterable of variable indices or names. If var is not specified, all the variables are specified.
  • obs (int or list-like, optional) – Observations to display. It can be specified as a single observation index or an iterable of observation indices. If obs is not specified, all the observations are specified.
Raises:

ValueError – This error can be raised if

  • any of the variable indices or names specified in var is out of range or not found.
  • any of the observation indices specified in obs is out of range.
static readBytes(sc, length)

Read a sequence of bytes from a strL in the current Stata dataset.

Parameters:
Returns:

The array of bytes. An empty array of bytes is returned if there are no more data because the end has been reached.

Return type:

bytes

Raises:
  • ValueError – If length is not a positive integer.
  • IOError – If failure occurred when attempting to read a sequence of bytes.
static renameVar(var, name)

Rename a Stata variable.

Parameters:
  • var (str or int) – Name or index of the variable to rename.
  • name (str) – New variable name.
Raises:

ValueError – This error can be raised if

  • var is not found or out of range.
  • name is not a valid Stata variable name.
static setObsTotal(nobs)

Set the number of observations in the current Stata dataset.

Parameters:nobs (int) – The number of observations to set.
Raises:ValueError – If the number of observations to set, nobs, exceeds the limit of observations.
static setVarFormat(var, format)

Set the format for a Stata variable.

Parameters:
  • var (int or str) – Index or name of the variable to format.
  • format (str) – New format.
Raises:

ValueError – This error can be raised if

  • var is out of range or not found.
  • format is not a valid Stata format.
static setVarLabel(var, label)

Set the label for a Stata variable.

Parameters:
  • var (int or str) – Index or name of the variable to label.
  • label (str) – New label.
Raises:

ValueError – If var is out of range or not found.

static store(var, obs, val, selectvar=None)

Store values in the current Stata dataset.

Parameters:
  • var (int, str, list-like, or None) – Variables to access. It can be specified as a single variable index or name, an iterable of variable indices or names, or None. If None is specified, all the variables are specified.
  • obs (int, list-like, or None) – Observations to access. It can be specified as a single observation index, an iterable of observation indices, or None. It None is specified, all the observations are specified.
  • val (array-like) – Values to store. The dimensions of val should match the dimensions implied by var and obs. Each of the values can be numeric or string based on the corresponding variable data types.
  • selectvar (int or str, optional) – Only store values for observations with selectvar!=0. If selectvar is an integer, it is interpreted as a variable index. If selectvar is a string, it should contain the name of a Stata variable. Specifying selectvar as “” has the same result as not specifying selectvar, which means values are stored for all observations specified. Specifying selectvar as -1 means that observations with missing values for the variables specified in var are to be skipped.
Raises:
  • ValueError – This error can be raised if
    • any of the variable indices or names specified in var is out of range or not found.
    • any of the observation indices specified in obs is out of range.
    • dimensions of val do not match the dimensions implied by var and obs.
    • selectvar is out of range or not found.
  • TypeError – If any of the values specified in val does not match the corresponding variable data type.
static storeAt(var, obs, val)

Store a value in the current Stata dataset.

Parameters:
  • var (int or str) – Variable to access. It can be specified as the variable index or name.
  • obs (int) – Observation to access.
  • val (float or str) – Value to store. The value data type depends on the corresponding variable data type.
Raises:

ValueError – This error can be raised if

  • var is out of range or not found.
  • obs is out of range.
static storeBytes(sc, b, binary)

Store a byte buffer to a strL in the current Stata dataset. You do not need to call allocateStrL() before using this method.

Parameters:
  • sc (StrLConnector) – The StrLConnector representing a strL.
  • b (bytes or bytearray) – Bytes to store.
  • binary (bool) – Mark the data as binary.
static writeBytes(sc, b, off=None, length=None)

Write length bytes from the specified byte buffer starting at offset off to a strL in the current Stata dataset; the strL must be allocated using allocateStrL() before calling this method.

Parameters:
  • sc (StrLConnector) – The StrLConnector representing a strL.
  • b (bytes or bytearray) – The buffer holding the data to store.
  • off (int, optional) – The offset into the buffer. If not specified, 0 is used.
  • length (int, optional) – The number of bytes to write. If not specified, the size of b is used.
Raises:

ValueError – This error can be raised if

  • off is negative.
  • length is not a positive integer.

Examples

The following provides a few quick examples illustrating how to use this class:

>>> from sfi import Data
>>> stata: sysuse auto, clear
(1978 Automobile Data)
>>> Data.getAt(0, 0)
'AMC Concord'
>>> Data.get(0, 0)
[[AMC Concord]]
>>>
>>> Data.get(var='price')
[4099, 4749, 3799, 4816, 7827, 5788, 4453, 5189, 10372, 4082, 11385, 14500, 15906, 3299, 5705,
 4504, 5104, 3667, 3955, 3984, 4010, 5886, 6342, 4389, 4187, 11497, 13594, 13466, 3829, 5379,
 6165, 4516, 6303, 3291, 8814, 5172, 4733, 4890, 4181, 4195, 10371, 4647, 4425, 4482, 6486, 40
 60, 5798, 4934, 5222, 4723, 4424, 4172, 9690, 6295, 9735, 6229, 4589, 5079, 8129, 4296, 5799,
 4499, 3995, 12990, 3895, 3798, 5899, 3748, 5719, 7140, 5397, 4697, 6850, 11995]
>>>
>>> Data.get(obs=0)
['AMC Concord', 4099, 22, 3, 2.5, 11, 2930, 186, 40, 121, 3.5799999237060547, 0]
>>>
>>> Data.get([0,2,3], [0,2,4,6])
[['AMC Concord', 22, 3], ['AMC Spirit', 22, 8.98846567431158e+307], ['Buick Electra', 15, 4],
 ['Buick Opel', 26, 8.98846567431158e+307]]
>>> Data.get(var='foreign')
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>>
>>> Data.get(var='foreign', valuelabel=True)
['Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic',
 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic',
 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic',
 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic',
 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic',
 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic', 'Domestic',
 'Domestic', 'Domestic', 'Domestic', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign',
 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign',
 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign', 'Foreign']
>>> Data.getVarLabel(0)
'Make and Model'
>>> Data.getVarLabel('price')
'Price'
>>> Data.setVarLabel(1, 'Retail Price')
>>> Data.setVarLabel('mpg', 'Mileage per Gallon')
>>> Data.renameVar(0, 'make2')
>>> Data.renameVar('price', 'price2')
>>> Data.dropVar("make2")
>>> Data.dropVar("price2 mpg rep78")
>>> Data.dropVar(0)
>>> Data.dropVar([0,2,3])
>>> stata: sysuse auto, clear
(1978 Automobile Data)
>>> Data.get(var='rep78')
[3, 3, 8.98846567431158e+307, 3, 4, 3, 8.98846567431158e+307, 3, 3, 3, 3, 2, 3, 3, 4, 3, 2, 2,
 3, 5, 2, 2, 2, 4, 3, 3, 3, 3, 4, 4, 3, 3, 4, 3, 4, 3, 3, 4, 3, 1, 3, 3, 5, 3, 8.98846567431158e+307,
 2, 4, 1, 3, 3, 8.98846567431158e+307, 2, 5, 3, 4, 4, 5, 4, 4, 3, 5, 4, 4, 8.98846567431158e+307,
 3, 5, 5, 5, 5, 4, 5, 4, 4, 5]
>>>
>>> Data.get(var='rep78', missingval=-100)
[3, 3, -100, 3, 4, 3, -100, 3, 3, 3, 3, 2, 3, 3, 4, 3, 2, 2, 3, 5, 2, 2, 2, 4, 3, 3, 3, 3, 4, 4,
 3, 3, 4, 3, 4, 3, 3, 4, 3, 1, 3, 3, 5, 3, -100, 2, 4, 1, 3, 3, -100, 2, 5, 3, 4, 4, 5, 4, 4, 3,
 5, 4, 4, -100, 3, 5, 5, 5, 5, 4, 5, 4, 4, 5]
>>>
>>> Data.get(var='rep78', missingval=None)
[3, 3, None, 3, 4, 3, None, 3, 3, 3, 3, 2, 3, 3, 4, 3, 2, 2, 3, 5, 2, 2, 2, 4, 3, 3, 3, 3, 4, 4,
 3, 3, 4, 3, 4, 3, 3, 4, 3, 1, 3, 3, 5, 3, None, 2, 4, 1, 3, 3, None, 2, 5, 3, 4, 4, 5, 4, 4, 3,
 5, 4, 4, None, 3, 5, 5, 5, 5, 4, 5, 4, 4, 5]
>>>
>>> import numpy as np
>>> Data.get(var='rep78', missingval=np.nan)
[3, 3, nan, 3, 4, 3, nan, 3, 3, 3, 3, 2, 3, 3, 4, 3, 2, 2, 3, 5, 2, 2, 2, 4, 3, 3, 3, 3, 4, 4, 3,
 3, 4, 3, 4, 3, 3, 4, 3, 1, 3, 3, 5, 3, nan, 2, 4, 1, 3, 3, nan, 2, 5, 3, 4, 4, 5, 4, 4, 3, 5, 4,
 4, nan, 3, 5, 5, 5, 5, 4, 5, 4, 4, 5]

Next we will show you a few advanced examples to illustrate how to communicate between Stata and Python using this class. Suppose we want to calculate the mean, standard deviation, minimum value, and maximum value of specified variables in the current dataset using Python, and then we want to create an output from the results in a similar format to that produced by the Stata command summarize.

We have data containing information on various automobiles, including the variables price, the price of the automobile; mpg, the mileage rating; rep78, the repair record in 1978; and headroom, the headroom size in inches. We can obtain summary statistics for those variables in Stata by typing the following:

. sysuse auto, clear
(1978 Automobile Data)
. local varlist price mpg rep78 headroom
. summarize `varlist'

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5

We see that there are only 69 observations for rep78, so some of the observations are missing. To get a similar output using Python, we can work with each variable using the following steps:

  • Determine whether the variable is a string or a numeric variable. If it is a string variable, skip it.
  • For each observation of the variable, determine whether it is missing or not. If it is missing, skip it.
  • Calculate the summary statistics using the nonmissing observations.

With this in mind, we construct a program in a Python script, say, dataex1.py, as follows:

from sfi import Data, Macro, Missing, SFIToolkit

#obtain list of the variables to summarize
varlist = Macro.getLocal("varlist")
vars = varlist.split(" ")
nobs = Data.getObsTotal()

#display the header
SFIToolkit.displayln("\n" +   "    " + "Variable {c |}        Obs        Mean    Std. Dev.       Min        Max")
SFIToolkit.displayln("{hline 13}{c +}{hline 57}")

for var in vars:
    sum = 0
    maxv = 0
    minv = Missing.getValue()
    avgv = 0
    stddev = 0
    count = 0

    #skip the variable if it is string
    if not Data.isVarTypeStr(var):

        #calculate mean, max, min
        for obs in range(nobs):
            #obtain the observation value
            value = Data.getAt(var, obs)

            #skip the missing observations
            if Missing.isMissing(value):
                continue

            if value > maxv:
                maxv = value
            if value < minv:
                minv = value

            sum += value
            count += 1

        avgv = sum / count

        #calculate std. dev.
        d2sum = 0
        for obs in range(nobs):
            value = Data.getAt(var, obs)
            if Missing.isMissing(value):
                continue

            d2sum += pow(value-avgv,2)

        stddev = pow(d2sum/(count-1), 0.5)

        #display the results
        out = "%12s {c |}%11s" % (var, SFIToolkit.formatValue(count, "%11.0gc"))
        if count>0:
            out += "   %9s" % (SFIToolkit.formatValue(avgv,  "%9.0g"))
            out += "   %9s" % (SFIToolkit.formatValue(stddev,"%9.0g"))
            out += "  %9s" % (SFIToolkit.formatValue(minv,   "%9.0g"))
            out += "  %9s" % (SFIToolkit.formatValue(maxv,   "%9.0g"))

            SFIToolkit.displayln(out)

We run the script file in Stata using python script, which creates the following output:

 . python script dataex1.py

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5

In the example above, we looped over each observation by using the getAt() method to get the observation values of each variable, and we operated on those values to calculate the summary statistics. We can also store all the values of each variable in a list and operate on the list to calculate those statistics. This can be done using the get() method. We reconstruct the program in a Python script file, say, dataex2.py, as follows:

from sfi import Data, Macro, Missing, SFIToolkit
import numpy as np
from math import sqrt

#obtain list of the variables to summarize
varlist = Macro.getLocal("varlist")
vars = varlist.split(" ")

#display the header
SFIToolkit.displayln("\n" +   "    " + "Variable {c |}        Obs        Mean    Std. Dev.       Min        Max")
SFIToolkit.displayln("{hline 13}{c +}{hline 57}")

for var in vars:

    #skip the variable if it is string
    if not Data.isVarTypeStr(var):

        #get the observation values in a list and construct a numpy array
        #using nonmissing observations
        vals = np.array(Data.get(var=var, selectvar=-1))

        #calculate summary statistics
        count = vals.size
        avgv = np.mean(vals)
        stddev = np.std(vals)*sqrt(count*1.0/(count-1))
        maxv = np.max(vals)
        minv = np.min(vals)

        #display the results
        out = "%12s {c |}%11s" % (var, SFIToolkit.formatValue(count, "%11.0gc"))
        if count>0:
            out += "   %9s" % (SFIToolkit.formatValue(avgv,  "%9.0g"))
            out += "   %9s" % (SFIToolkit.formatValue(stddev,"%9.0g"))
            out += "  %9s" % (SFIToolkit.formatValue(minv,   "%9.0g"))
            out += "  %9s" % (SFIToolkit.formatValue(maxv,   "%9.0g"))

            SFIToolkit.displayln(out)

Notice that we specified selectvar as -1 when we used get() to store the observation values of each variable in a list. This caused the missing values (if there were any) to be skipped, so as to avoid using them in further calculations. Otherwise, you would need to remove them from further calculations yourself.

Afterward, we created a numpy array using the list. Then we calculated the summary statistics using the array operations. Running the above script file produces the following output:

 . python script dataex2.py

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5

Oftentimes, you may want to calculate the summary statistics based on some criteria, such as the average price for cars that have a mileage rating greater than 25 mpg or the minimum and maximum mileage ratings for foreign cars. We can obtain computations based on specified criteria by using Stata commands that support the if and in qualifiers in our ado-file just like summarize does. We write a command, pysumm, in the ado-file pysumm.ado as follows:

program pysumm
    version 16
    syntax varlist [if] [in]

    //mark the observations for use
    marksample touse, novarlist

    //call the Python function for calculation
    python: pysummarize("`varlist'", "`touse'")
end

version 16
python:
from sfi import Data, SFIToolkit, Missing, Scalar
import numpy as np
from math import sqrt

def pysummarize(varlist, touse):
    vars = varlist.split(" ")

    #display the header
    SFIToolkit.display("\n" +   "    " + "Variable {c |}        Obs        Mean    Std. Dev.       Min        Max")
    SFIToolkit.display("{hline 13}{c +}{hline 57}")

    #clear the r() results
    SFIToolkit.rclear()

    for var in vars:

        #skip the variable if it is string
        if not Data.isVarTypeStr(var):

            #get the filtered observation values in a list and
            #construct a numpy array
            vals = np.array(Data.get(var=var, selectvar=touse))

            #skip missing observations
            vals = vals[vals < Missing.getValue()]

            #calculate summary statistics
            count = vals.size
            avgv = np.mean(vals)
            stddev = np.std(vals)*sqrt(count*1.0/(count-1))
            maxv = np.max(vals)
            minv = np.min(vals)

            #display the results
            out = "%12s {c |}%11s" % (var, SFIToolkit.formatValue(count, "%11.0gc"))
            if count>0:
                out += "   %9s" % (SFIToolkit.formatValue(avgv,  "%9.0g"))
                out += "   %9s" % (SFIToolkit.formatValue(stddev,"%9.0g"))
                out += "  %9s" % (SFIToolkit.formatValue(minv,   "%9.0g"))
                out += "  %9s" % (SFIToolkit.formatValue(maxv,   "%9.0g"))

                SFIToolkit.display(out)

                #store the mean of each variable in r()
                Scalar.setValue("r(mean_of_"+var+")", avgv)

end

The ado-file contains both ado-code and Python code. The ado-code handled all issues of parsing and identifying the subsample of the data to be used. The Python code defined a function, pysummarize(), to calculate the summary statistics. The ado-code called the Python function.

In the ado-code, we used marksample to create a 0/1 variable that records which observations are to be used in subsequent code. By default, the to-use variable was set to 0 if any variables in varlist contained missing values or if the corresponding observation did not satisfy the if and in qualifiers (if they were specified). In this command, we need to calculate the summary statistics for each variable independently, so we do not want to skip observations with missing values for any variable in varlist. Otherwise, an observation with a nonmissing value for one variable may be skipped if it has a missing value for another variable in varlist. The novarlist option handles this for you. So far, the to-use variable only marked observations that did not satisfy the if and in qualifiers with 0.

All the Python code was defined within the python: and end block. Here we defined a few import statements and a function, pysummarize(). The Python function triggered the ado-code to call Python to perform the calculation. As a connection between the ado-code and the Python code, it received two strings as arguments: one string contained the names of the variables in the Stata dataset to summarize and the other string contained the name of the to-use variable that identified the subsample of the data to be used.

Within the Python function, we stored the observation values of each variable that satisfied the qualifiers in a list and constructed a numpy array using the list. We skipped the missing values from the array. Afterward, we calculated the summary statistics and displayed them for each variable. In addition, we stored the mean of each variable in r-class scalars. You can store other statistics in Stata too if you like.

Running the command on the above variables produces the following output:

 . pysumm `varlist'

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         74    6165.257    2949.496       3291      15906
         mpg |         74     21.2973    5.785503         12         41
       rep78 |         69    3.405797    .9899323          1          5
    headroom |         74    2.993243    .8459948        1.5          5

 . return list

scalars:
      r(mean_of_price) =  6165.256756756757
        r(mean_of_mpg) =  21.2972972972973
      r(mean_of_rep78) =  3.405797101449275
   r(mean_of_headroom) =  2.993243243243243

Suppose we want to calculate the summary statistics for the first 50 cars, but only if they have a mileage rating greater than 25 mpg. We can type

 . summarize `varlist' in 1/50 if mpg > 25

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |          7    4526.143    973.8098       3299       6486
         mpg |          7    28.71429    2.751623         26         34
       rep78 |          5           4           1          3          5
    headroom |          7    2.142857    .5563486        1.5          3

 . pysumm `varlist' in 1/50 if mpg > 25

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |          7    4526.143    973.8098       3299       6486
         mpg |          7    28.71429    2.751623         26         34
       rep78 |          5           4           1          3          5
    headroom |          7    2.142857    .5563486        1.5          3

 . return list

scalars:
      r(mean_of_price) =  4526.142857142857
        r(mean_of_mpg) =  28.71428571428572
      r(mean_of_rep78) =  4
   r(mean_of_headroom) =  2.142857142857143