Stata Conference Chicago 2019

Chicago 2019

11–12 July

The Stata Conference was held 11-12 July 2019, but you can view the proceedings and presentation slides (below) and the conference photos.

Don't forget to save the date and join us next year at the 2020 Stata Conference in Philadelphia, Pennsylvania, on 30-31 July 2020!

Using Stata for data collection and management

ietoolkit: How DIME Analytics develops Stata code from primary data work Abstract: Over the years, the complexity of data work in development research has grown exponentially, and standardizations for workflows are needed for researchers and data analysts

...(Read more)

to work simultaneously on multiple projects. ietoolkit was developed to standardize and simplify best practices for data management and analysis across the 100-plus members of the World Bank's Development Research Group, Impact Evaluations team (DIME). It includes a standardized project folder structure; standardized Stata "boilerplate" code; standardized balance tables, graphs, and matching procedures; and modified dropping and saving commands with built-in safety checks. The presentation will outline how the ietoolkit structure is meant to serve as a guide for projects to move their data through the analysis process in a standardized way, as well as offer a brief introduction to the other commands. The intent is for many projects within one organization to have a predictable workflow, such that researchers and data analysts can move between multiple projects and support other teams easily and rapidly without expending time relearning idiosyncratic project organization structures and standards. These tools are developed open-source on GitHub and available publicly.

(Read less)

Additional information:
chicago19_Bjarkefur.pdf

Kristoffer Bjarkefur

World Bank Group (DIME)

iefieldkit: Stata commands for primary data collection and cleaning Abstract: Data collection and cleaning workflows use highly repetitive but extremely important processes. iefieldkit was developed to standardize and simplify

...(Read more)

best practices for high-quality primary data collection across the 100-plus members of the World Bank's Development Research Group, Impact Evaluations team (DIME). It automates error-checking for electronic ODK-based survey modules such as those implemented in SurveyCTO; duplicate checking and resolution; data cleaning, including renaming, labeling, recoding, and survey harmonization; and codebook creation. The presentation will outline how the iefieldkit package is intended to provide a data-collection workflow skeleton for nearly any type of primary data collection, from questionnaire design to data import. One feature of many iefieldkit commands is their utilization of spreadsheet-based workflows, which reduce repetitive coding in Stata and document corrections and cleaning in a human-readable format. This enables rapid review of data quality in a standardized process, with the goal of producing maximally clean primary data for the downstream data construction and analysis phases in a transparent and accessible manner. These tools are developed open-source on GitHub and available publicly.

(Read less)

Additional information:
chicago19_Daniels.pdf

Benjamin Daniels

World Bank Group (DIME)

Graphics development

Barrel-aged software development: brewscheme as a four-year-old Abstract: The term "software development" implies some type of change over time. While Stata goes through extraordinary steps to support backward compatibility, user-contributors may not always see a need to continue developing programs shared with the community.

...(Read more)

How do you know if or when you should add additional programs or functionality to an existing package? Is it easy and practical to extend existing Stata code, or is it easier to refactor everything from the ground up? What can you do to make it easier to extend existing code? While brewscheme may have started as a relatively simple package with a couple of commands and limited functionality, in the four years since it was introduced, it has grown into a multifunctional library of tools to make it easier to create customized visualizations in Stata while being mindful of color sight impairments. I will share my experience, what I have learned, and strategies related to how I dealt with these questions in the context of the development of the brewscheme package. I will also show what the additional features do that the original brewscheme did not do.

(Read less)

Additional information:
chicago19_Buchanan (https:)

Billy Buchanan

Fayette County Public Schools

Substantive applications

Simulating baboon behavior using Stata Abstract: This presentation originated from a field study of the behavior of feral baboons in Tanzania. The field study used behavior sampling methods, including on-the-moment (instantaneous) and thru-the-moment (one-zero). Some primatologists critiqued behavioral sampling as not reflecting true frequency or duration. A Monte Carlo simulation study was performed to compare behavior sampling with actual frequency and duration.

Additional information:
chicago19_Ender.pdf

Phil Ender

UCLA Retired

Using cluster analysis to understand complex datasets: Experience from a national nursing consortium Abstract: Cluster analysis is a type of exploratory data analysis for classifying observations and identifying distinct groups. It may be useful for complex datasets where commonly used regression modeling approaches may be inadequate because of outliers, complex interactions, or violation of assumptions.

...(Read more)

In health care, the complex effect of nursing factors (including staffing levels, experience, and contract status), hospital size, and patient characteristics on patient safety (including pressure ulcers and falls) has not been well understood. In this presentation, I will explore the use of Stata cluster analysis (cluster) to describe five groups of hospital units that have distinct characteristics to predict patient pressure ulcers and hospital falls in relationship to employment of supplemental registered nurses (SRNs) in a national nursing database. The use of SRNs is a common practice among hospitals to fill gaps in nurse staffing. But the relationship between the use of SRNs and patient outcomes varies widely, with some groups reporting a positive relationship, while other groups report an adverse relationship. The purpose of this presentation is to identify the advantages and disadvantages of cluster analysis and other methods when analyzing nonnormally distributed, nonlinear data that have unpredictable interactions.

(Read less)

Additional information:
chicago19_Williams.pptx

Barbara Williams

Virginia Mason Medical Center

The individual process of neighborhood change and residential segregation in 1940: An implication of a discrete choice model Abstract: Using the 1940 restricted census microdata, this study develops discrete choice models to investigate how individual and household characteristics, along with the features of neighborhoods of residence, affect individual choices of residential outcomes in US cities.

...(Read more)

This study will make several innovations: (1) We will take advantage of 100% census microdata on the whole population of the cities to establish discrete choice models estimating the attributes of alternatives (for example, neighborhoods) and personal characteristics simultaneously. (2) This study will set a routine of reconstructing personal records to the data structure eligible for discrete choice models and then test whether the assumptions are violated. (3) This study will assess the extent and importance of discrimination and residential preferences, respectively, through the model specification. The results suggest that both in-group racial and class preferences can explain the individual process of neighborhood changes. All groups somehow practice out-group avoidance based on race and social class. Such phenomena are more pronounced in multiracial cities.

(Read less)

Additional information:
chicago19_Zou.pptx

Karl X.Y. Zou

Texas A&M University

Featured presentation from StataCorp

Using Python within Stata Abstract: Users may extend Stata's features using other programming languages such as Java and C. New in Stata 16, Stata has tight integration with Python, which allows users to embed and execute Python code from within Stata. I will discuss how users can easily call Python from Stata, output Python results within Stata, and exchange data and results between Python and Stata, both interactively and as sub-routines within do-files and ado-files. I will also show examples of the Stata Function Interface (sfi); a Python module provided with Stata which provides extensive facilities for accessing Stata objects from within Python.

Additional information:
chicago19_Peng (https:)

Hua Peng

StataCorp

Difference-in-differences

Extending the difference-in-differences (DID) to settings with many treated units and same intervention time: Model and Stata implementation Abstract: The difference-in-differences (DID) estimator is popular to estimate average treatment effects in causal inference studies. Under the common support assumption, DID overcomes the problem of unobservable selection using panel, time, or location fixed effects and the knowledge of the pre- or postintervention times.

...(Read more)

New developments of DID have been recently proposed: (i) the synthetic control method (SCM) applies when a long pre- and postintervention time series is available, only one unit is treated, and intervention occurs in a specific time (implemented in Stata via SYNTH by Hainmueller, Abadie, Dimond [2014]); (ii) an extension to binary time-varying treatment with many treated units has also been proposed and implemented in Stata via TVDIFF (Cerulli and Ventura, 2018). However, a command to accommodate a setting with many treated units and the same intervention time is still lacking. In this presentation, I propose a potential-outcome model to accommodate this latter setting and provide a Stata implementation via the new Stata routine FTMTDIFF (standing for fixed-time multiple treated DID). I will finally set some guidelines for future DID developments.

(Read less)

Additional information:
chicago19_Cerulli.pdf

Giovanni Cerulli

IRCrES-CNR, National Research Council of Italy

Bacon decomposition for understanding differences-in-differences with variation in treatment timing Abstract: In applications of a difference-in-differences (DD) model, researchers often exploit natural experiments with variation in onset, comparing outcomes across groups of units that receive treatment starting at different times. Goodman-Bacon (2019) shows that this DD estimator is a weighted average of all possible two-group or two-period DD estimators in the data. The bacon command performs this decomposition and graphs all two-by-two DD estimates against their weight, which displays all identifying variation for the overall DD estimate. Given the widespread use of the two-way fixed effects DD model, bacon has broad applicability across domains and will help researchers understand how much of a given DD estimate comes from different sources of variation.

Additional information:
chicago19_Goodman-Bacon.pdf

Andrew Goodman-Bacon

Vanderbilt University

Stata programming

The matching problem using Stata Abstract: A main purpose of this presentation is to discuss an algorithm for the matching problem. As an example, K-cycle Kidney exchange problem is defined and solved using user-written Stata program.

Additional information:
chicago19_Lee.pdf

Choonjoo Lee

Korea National Defense University

Mata implementation of Gauss-Legendre quadrature in the M-estimation context: Correcting for sample-selection bias in a generic nonlinear setting Abstract: Many contexts in empirical econometrics require nonclosed form integration for appropriate modeling and estimation design. Applied researchers often avoid such correct but computationally demanding specifications and opt for simpler misspecified modeling designs.

...(Read more)

The presentation will detail a newly developed Mata implementation of a relatively simple numerical integration technique – Gauss-Legendre quadrature. Although this Mata code is applicable in a variety of circumstances, it was mainly written for use in M-estimation when the relevant objective function (for example, the likelihood function) involves integration at the observation level. As inputs, the user supplies a vector-valued integrand function (for example, a vector of sample log-likelihood integrands) and a matrix of upper and lower integration limits. The code outputs the corresponding vector of integrals (for example, the vector of observation-specific log-likelihood values). To illustrate the use of this Mata implementation, we conduct an empirical analysis of classical sample-selection bias in the estimation of wage offer regressions. We estimate a nonlinear version of the model based on the modeling approach suggested by Terza (Econometric Reviews 2009) which requires numerical integration. This model is juxtaposed with the classical linear sample-selection specification of Heckman (Annals of Economic and Social Measurement 1976), for which numerical integration is not required.

(Read less)

Additional information:
chicago19_Terza.pdf

Joseph Terza

Indiana University Purdue University Indianapolis

Economic applications

A practical application of the mvport package: CAPM-based optimal portfolios Abstract: The mvport package has commands for financial portfolio optimization and portfolio backtesting. I present a practical implementation of a CAPM-based strategy to select stocks, and then apply different optimization settings, and evaluate the resulting portfolios.

...(Read more)

The presentation illustrates how to automate the process through a simple do-file that allows to easily change parameters (for example, stock list, market index, risk-free rate) using an Excel interface. The program automates the following: a) data collection, b) CAPM model estimation for all stocks, c) selection of stocks based on CAPM parameters, d) portfolio optimization with different configurations, and e) portfolio backtesting. For data collection, the getsymbols and the freduse command is used to get online price data for all the S&P500 stocks and the risk-free rate. For each stock, two competing CAPM models are estimated: using a simple regression and using an autoregressive conditional heteroskedasticity (ARCH) model. The CAPM parameters are used to select stocks. Then the mvport package is used to optimize different configurations of the portfolio. Finally, the performance of each portfolio configuration is calculated and compared with the market portfolio.

(Read less)

Additional information:
chicago19_Dorantes.pdf
chicago19_Dorantes.xlsx

Carlos Dorantes

Tec de Monterrey

Tools to analyze interest rates and value bonds Abstract: Bond markets contain a wealth of information about investor preferences and expectations. However, extracting such information from market interest rates can be computationally burdensome. I introduce a suite of new Stata commands to aid finance professionals and researchers in using Stata to analyze the term structure of interest rates and value bonds. The genspot command uses a bootstrap methodology to construct a spot rate curve from a yield curve of market interest rates under a no-arbitrage assumption. The genfwd command generates a forward rate curve from a spot rate curve, allowing researchers to infer market participants’ expectations of future interest rates. Finally, the pricebond command uses forward rates to value a bond with user-specified terms.

Additional information:
chicago19_Schmidt.pdf

Tim Schmidt

Discover Financial Services

Panel stochastic frontier models with endogeneity in Stata Abstract: I introduce xtsfkk, a new Stata command for fitting panel stochastic frontier models with endogeneity. The advantage of xtsfkk is that it can control for the endogenous variables in the frontier or the inefficiency term in a longitudinal setting. Hence, xtsfkk performs better than standard panel frontier methodologies such as xtfrontier that overlook endogeneity by design.

Additional information:
chicago19_Karakaplan.pptx

Mustafa Karakaplan

Statistical topics

Recentered influence functions (RIF) in Stata: RIF-regression and RIF-decomposition Abstract: Recentered influence functions (RIF) are statistical tools that have been popularized by Firpo, Fortin, and Lemieux (2009) for analyzing unconditional partial effects (UPE) on quantiles in a regression analysis framework (unconditional quantile regressions). The flexibility and simplicity of this tool, however, has opened the possibility to extend the analysis to other distributional statistics, using linear regressions or decomposition approaches. In this presentation, I introduce three Stata commands to facilitate the use of RIFs in the analysis of outcome distributions: rifvar() is an egen extension used to create RIFs for a large set of distributional statistics; rifhdreg facilitates the estimation of RIF regressions enabling the use of high-dimensional fixed effects; and oaxaca_rif implements Oaxaca-Blinder-type decomposition analysis.

Additional information:
chicago19_Rios-Avila.pdf

Fernando Rios-Avila

Verifying the existence of maximum likelihood estimates in generalized linear models Abstract: There has been considerable ambiguity over how to verify whether estimates from nonlinear models "exist" and what can be done if they do not. This is the so-called separation problem. We characterize the problem in detail across a wide range of generalized linear models and introduce a novel method for dealing with it in the presence of high-dimensional fixed effects, as are often recommended for gravity models of international trade and in other common panel-data settings. We have included these methods in a new Stata command for HDFE-Poisson estimation called PPMLHDFE. We have also created a suite of test cases developers may use in the future for testing whether their estimation packages are correctly identifying instances of separation. These projects are joint with Sergio Correia and Paulo Guimaraes. We have written two papers related to these topics and also created a website with example code and data illustrating the separation issue and how we solve it. Please see our github for more details: https://github.com/sergiocorreia/ppmlhdfe/.

Additional information:
chicago19_Zylkin.pdf

Thomas Zylkin

University of Richmond

Unbiased IV in Stata Abstract: A well-known result is that exactly identified IV has no moments, including in the ideal case of an experimental design (that is, a randomized control trial with imperfect compliance). This result no longer holds when the sign of the first stage is known, however. I describe a Stata implementation of an unbiased estimator for instrumental-variable models with a single endogenous regressor where the sign of one or more first‐stage coefficients is known (due to Andrews and Armstrong 2017) and its finite sample properties under alternative error structures.

Additional information:
chicago19_Nichols.pdf

Austin Nichols

Abt Associates

Using Stata for reproducible research

Connecting Stata and Microsoft Word using StatTag for collaborative reproducibility Abstract: Although Stata can render output and reports to Microsoft Word, pdf and html files, Stata users must sometimes transcribe statistical content in to separate Microsoft Word documents (for example, documents drafted by colleagues in Word or documents that must be prepared in Word), a process that is error prone, irreproducible, and inefficient.

...(Read more)

This talk will illustrate how StatTag (www.stattag.org), an open source, free, and user-friendly program that we developed, addresses this problem. Since its introduction in 2016, StatTag has undergone substantial improvements and refinements. StatTag establishes a bidirectional link between Stata files and a Word document and supports a reproducible pipeline even when (1) statistical results must be included and updated in Word documents that were never generated from Stata; and (2) text in Word files generated from Stata has departed substantially from original content, for example, through tracked changes or comments. We will demonstrate how to use StatTag to connect Stata and Word files so that all files can be edited separately, but statistical content—values, tables, figures, and verbatim output—can be updated automatically in Word. Using practical examples, we will also illustrate how to use StatTag to view, edit, and rerun Stata code directly from Word.

(Read less)

Additional information:
chicago19_Baldridge.pptx

Abigail S. Baldridge

Northwestern University

Featured presentations from StataCorp

Using lasso and related estimators for prediction

Abstract: Lasso and elastic net are two popular machine-learning methods. In this presentation, I will discuss Stata 16's new features for lasso and elastic net, and I will demonstrate how they can be used for prediction with linear, binary, and count outcomes. We will discover why these methods are effective and how they work.

Additional information:
chicago19_Liu.pdf

Di Liu

StataCorp

Inference after lasso model selection Abstract: The increasing availability of high-dimensional data and increasing interest in more realistic functional forms have sparked a renewed interest in automated methods for selecting the covariates to include in a model. I discuss the promises and perils of model selection and pay special attention to estimators that provide reliable inference after model selection. I will demonstrate how to use Stata 16's new features for double selection, partialing out, and cross-fit partialing out to estimate the effects of variables of interest while using lasso methods to select control variables.

Additional information:
chicago19_Drukker.pdf

David Drukker

StataCorp

Topics in biostatistics

Uncovering the true variability in meta-analysis results using resampling methods Abstract: Traditionally, meta-analyses are performed using a single effect estimate from each included study, resulting in a single combined effect estimate and confidence interval. However, there are a number of processes that could give rise to multiple effect estimates from each study, such as multiple individuals extracting study data, the use of different analysis methods for dealing with missing data or dropouts, and the use of different types of endpoints for measuring the same outcome.

...(Read more)

Depending on the number of studies and the number of possible estimates per study, the number of combinations of studies for which a meta-analysis could be performed could be in the thousands. Accordingly, meta-analysts need a tool that can iterate through all of these possible combinations (or a reasonably sized sample thereof), compute an effect estimate for each, and summarize the distribution of the effect estimates and standard errors for all combinations. We have developed a Stata command, resmeta, for this purpose that can generate results for 10,000 combinations in a few seconds. This command can handle both continuous and categorical data, can handle a variable number of estimates per study, and has options to compute a variety of different estimates and standard errors. In the presentation, we will cover case studies where this approach was applied, considerations for more general application of the approach, command syntax and options, and different ways of summarizing the results and evaluating different sources of variability in the results.

(Read less)

Additional information:
chicago19_Canner.pptx

Joseph Canner

Johns Hopkins University School of Medicine

Comparing treatments in the presence of competing risks based on life years lost Abstract: Competing risks are frequently encountered in medical research. Examples are clinical trials in head-and-neck and prostate cancer where deaths from cancer and deaths from other causes are competing risks.

...(Read more)

Andersen (Stat in Med 2013) showed that the area under the cause j cumulative incidence curve from 0 to t* can be interpreted as the number of life years lost (LYL) due to cause j before time t*. LYL can be estimated and compared in Stata using either the pseudo-observations approach described in Overgaard, Andersen, and Parner (Stata Journal 2015) or by modification of a routine by Pepe and Mori (Stat in Med 1993) for testing the equality of cumulative incidence curves. We describe an application of the method to the DeCIDE trial, a phase III randomized clinical trial of induction chemotherapy plus chemoradiotherapy versus chemoradiotherapy alone in patients with locally advanced head-and-neck cancer. We present simulation results demonstrating that the pseudo-observations and Pepe-Mori approaches yield similar results. We also evaluate the power obtained from comparing life years lost relative to standard procedures for analyzing competing risks data, including cause-specific logrank tests (Freidlin and Korn; Stat in Med 2005) and the Fine-Gray model (Fine and Gray; JASA 1999).

(Read less)

Additional information:
chicago19_Karrison.pptx

Theodore Karrison

University of Chicago and NRG Oncology

Hierarchical summary ROC analysis: A frequentist-Bayesian colloquy in Stata Abstract: Meta-analysis of diagnostic accuracy studies requires the use of more advanced methods than meta-analysis of intervention studies. Hierarchical or multilevel modeling accounts for the bivariate nature of the data, both within- and between-study heterogeneity and threshold variability. The hierarchical summary receiver operating characteristic (HSROC) and the bivariate random-effects models are currently recommended by the Cochrane Collaboration.

...(Read more)

The bivariate model is focused on estimating summary sensitivity and specificity and as a generalized linear mixed model is estimable in most statistical software, including Stata. The HSROC approach models the implicit threshold and diagnostic accuracy for each study as random effects and includes a shape or scale parameter that enables asymmetry in the SROC by allowing accuracy to vary with implicit threshold. As a generalized nonlinear mixed model, it has not been previously or directly estimable in Stata, though possible with WinBUGS and SAS Proc NLMIXED or indirectly extrapolating its parameters from the bivariate model in Stata. This talk will demonstrate for the first time how the HSROC model can be fit in Stata using ML programming and the recently introduced bayesmh command. Using a publicly available dataset, I will show the comparability of Stata results with those obtained with WinBUGS and SAS Proc NLMIXED.

(Read less)

Additional information:
chicago19_Dwamena.pdf

Ben Adarkwa Dwamena

University of Michigan Medical School

Using Stata to solve problems in applied work

kmr: A command to correct survey weights for unit nonresponse using a group's response rates Abstract: This article describes kmr, a Stata command to estimate a micro compliance function using group level nonresponse rates (2007, Journal of Econometrics 136: 213-235), which can be used to correct survey weights for unit nonresponse. We illustrate the use of kmr with an empirical example using the Current Population Survey and state-level nonresponse rates.

Additional information:
chicago19_Munoz.pdf

Ercio Munoz

CUNY Graduate Center and Stone Center on Socio-economic Inequality

tesensitivity: A Stata package for assessing the unconfoundedness assumption Abstract: This talk will discuss a new set of methods for quantifying the robustness of treatment effects estimated under the unconfoundedness assumption (also known as selection on observables or conditional ignorability). Specifically, we estimate bounds on the ATE, the ATT, and the QTE under nonparametric relaxations of unconfoundedness indexed by a scalar sensitivity parameter c. These deviations allow for limited selection on unobservables, depending on the value of c. For large enough c, these bounds equal the no assumptions bounds. Our methods allow for both continuous and discrete outcomes but require discrete treatments. We implement these methods in a new Stata package, tesensitivity, for easy use in practice. We illustrate how to use this package and these methods with an empirical application to the National Supported Work Demonstration program.

Additional information:
chicago19_Masten.pdf

Matthew Masten

Duke University

Poster session

Post-estimation analysis with Stata by SPost13 commands of survey data analyzed by MNLM

Additional information:
chicago19_Giovannelli.pdf

Debora Giovannelli

The causal effects of wages on labor supply for married women: Evidence from American couples

Additional information:
chicago19_Wen.pdf

Bob Wen

Clemson University

Fitting generalized linear models when the data exceeds available memory

Additional information:
chicago19_Canner.pdf

Joseph Canner

Johns Hopkins University School of Medicine

Estimation of varying coefficient models in Stata

Additional information:
chicago19_Rios-Avila.pdf

Fernando Rios-Avila

Levy Economics Institute

Psychiatric morbidity in physically injured children and adolescents: A national evaluation

Additional information:
chicago19_Tennakoon.pdf

Lakshika Tennakoon

Stanford University

Scientific committee

Phil Schumm (Chair)
Department of Public Health Sciences
The University of Chicago

Richard Williams
Department of Sociology
University of Notre Dame

Scott Long
Department of Sociology
Indiana University

Matias Cattaneo
Department of Economics
University of Michigan

To stay up to date with future Conference announcements, sign up for an alert now.

#Stata2019

Using Stata for data collection and management

Graphics development

Substantive applications

Featured presentation from StataCorp

Difference-in-differences

Stata programming

Economic applications

Statistical topics

Using Stata for reproducible research

Featured presentations from StataCorp

Topics in biostatistics

Using Stata to solve problems in applied work

Poster session

Scientific committee

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies

Stata/MP4 Annual License (download)

Using Stata for data collection and management

Graphics development

Substantive applications

Featured presentation from StataCorp

Difference-in-differences

Stata programming

Economic applications

Statistical topics

Using Stata for reproducible research

Featured presentations from StataCorp

Topics in biostatistics

Using Stata to solve problems in applied work

Poster session

Scientific committee

We use cookies

Privacy policy

Required cookies

Advertising and performance cookies